Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Filter by
Sorted by
Tagged with
0votes
0answers
15views

What would be the 'best' (efficient) way to implement a function that needs to integrate a large array of numbers?

I'm working with a function that needs to integrate the values of an array of numbers following an integrand function. This is implemented already in my code, but my problem is that each time I call ...
user avatar
  • 179
0votes
2answers
45views

How to write each tagged output to different file in Apache beam

I have this code which tags outputs based on some data of the input file: class filters(beam.DoFn): def process(self, element): data = json.loads(element) yield TaggedOutput(data['EventName']...
user avatar
0votes
0answers
11views

VScode crashed (reason: 'oom', code: '-536870904')

I am tring open a folder that I opened before, but it crashed. I can open other projects, and I restart the compoter didn't help. Maybe it's because I had a big file opened (400mb) in this folder, but ...
user avatar
0votes
0answers
16views

Need topic ideas for an university project related to digitalization and data science

I must prepare a project for my E-Business course in university. It has to be related to digitalization and data science and the field I'll be working on will be process models (for example CRISP-DM, ...
user avatar
  • 1
0votes
0answers
9views

Hive: Will SMB Join work if I bucket on 2 columns but use only one column in Joining condition

SMB requires that both tables are bucketed on the same columns and also sorted. What if I have bucketing on 2 columns say ColA and ColB of both tables, but I use only ColA as joining column. Will SMB ...
user avatar
0votes
0answers
21views

Ggplot does not label all interesting peptides for volcano plot

I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change. # A tibble: 6 x 3 Sequence p log2fold <chr> <dbl> <dbl> 1 ...
user avatar
1vote
1answer
52views

Unnest Query optimisation for singular record

I'm trying to optimise my query for when an internal customer only want to return one result *(and it's associated nested dataset). My aim is to reduce the query process size. However, it appears to ...
user avatar
  • 2,385
0votes
0answers
25views

SQL HIVE Conversion

I'm trying to convert a piece of SQL code to HiveQL, and it's not working as expected. Please find below the code snippet in SQL that I'm attempting to convert: SQL Code: UPDATE C SET C.prod_l =...
user avatar
0votes
1answer
29views

How can i make this algorithm more efficient using dataframes?

I am trying to get the outliers of a column (with IQR), once I get the outliers I want to set the values where the outliers are in my main dataframe to null in order to impute them afterwards. This is ...
user avatar
1vote
1answer
17views

Designing Twitter Search - How to sort large datasets?

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping English word -> A set of tweetIds having ...
user avatar
  • 1,578
0votes
1answer
33views

How to open networks over 30.000 nodes in Gephi

I am running Gephi 0.9.2 in Windows 10 (64x, 8Gb RAM) with normal performance in networks under 30.000 nodes. When I try with bigger datasets the program just crashes. Sometimes it happens when the ...
user avatar
  • 9
0votes
0answers
14views

How Apache Spark create ML model as the data is in distributed machine?

For example if I am creating Linear regression model and data is in 3 different machine so model creation will happen on 3 machine and will the final equation formed will be an average of all the ...
user avatar
0votes
1answer
38views

Reshaping big data long based on column name patterns

I have a (large, in reality) dataset with street blocks; it has house numbers for the beginning ("from", in the variables below) and end ("to") of the block, for both the right and ...
user avatar
  • 103
0votes
0answers
41views

How can I efficiently iterate through a pandas dataset that has millions of rows and pass a function to every row?

I have a pandas dataframe with 7 million instances of flight data. the flight data comes with the location and the time which I am using to pull weather for that time. Right now for 1000 instances, my ...
user avatar
1vote
1answer
26views

Apache NiFi: Remove Username and Password login by default on UI

I'm currently working on setting up NiFi, I've noticed as a part of the Version 1.14.0 release, by default we have security features enabled which requires a username and password to access the UI. ...
user avatar
  • 21

15 30 50 per page
1
2 3 4 5
498