Questions tagged [bigdata]
Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.
7,460
questions
0votes
0answers
15views
What would be the 'best' (efficient) way to implement a function that needs to integrate a large array of numbers?
I'm working with a function that needs to integrate the values of an array of numbers following an integrand function. This is implemented already in my code, but my problem is that each time I call ...
0votes
2answers
45views
How to write each tagged output to different file in Apache beam
I have this code which tags outputs based on some data of the input file:
class filters(beam.DoFn):
def process(self, element):
data = json.loads(element)
yield TaggedOutput(data['EventName']...
0votes
0answers
11views
VScode crashed (reason: 'oom', code: '-536870904')
I am tring open a folder that I opened before, but it crashed.
I can open other projects, and I restart the compoter didn't help.
Maybe it's because I had a big file opened (400mb) in this folder, but ...
0votes
0answers
16views
Need topic ideas for an university project related to digitalization and data science
I must prepare a project for my E-Business course in university. It has to be related to digitalization and data science and the field I'll be working on will be process models (for example CRISP-DM, ...
0votes
0answers
9views
Hive: Will SMB Join work if I bucket on 2 columns but use only one column in Joining condition
SMB requires that both tables are bucketed on the same columns and also sorted. What if I have bucketing on 2 columns say ColA and ColB of both tables, but I use only ColA as joining column. Will SMB ...
0votes
0answers
21views
Ggplot does not label all interesting peptides for volcano plot
I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 ...
1vote
1answer
52views
Unnest Query optimisation for singular record
I'm trying to optimise my query for when an internal customer only want to return one result *(and it's associated nested dataset). My aim is to reduce the query process size.
However, it appears to ...
0votes
0answers
25views
SQL HIVE Conversion
I'm trying to convert a piece of SQL code to HiveQL, and it's not working as expected.
Please find below the code snippet in SQL that I'm attempting to convert:
SQL Code:
UPDATE
C
SET
C.prod_l =...
0votes
1answer
29views
How can i make this algorithm more efficient using dataframes?
I am trying to get the outliers of a column (with IQR), once I get the outliers I want to set the values where the outliers are in my main dataframe to null in order to impute them afterwards. This is ...
1vote
1answer
17views
Designing Twitter Search - How to sort large datasets?
I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having ...
0votes
1answer
33views
How to open networks over 30.000 nodes in Gephi
I am running Gephi 0.9.2 in Windows 10 (64x, 8Gb RAM) with normal performance in networks under 30.000 nodes. When I try with bigger datasets the program just crashes. Sometimes it happens when the ...
0votes
0answers
14views
How Apache Spark create ML model as the data is in distributed machine?
For example if I am creating Linear regression model and data is in 3 different machine so model creation will happen on 3 machine and will the final equation formed will be an average of all the ...
0votes
1answer
38views
Reshaping big data long based on column name patterns
I have a (large, in reality) dataset with street blocks; it has house numbers for the beginning ("from", in the variables below) and end ("to") of the block, for both the right and ...
0votes
0answers
41views
How can I efficiently iterate through a pandas dataset that has millions of rows and pass a function to every row?
I have a pandas dataframe with 7 million instances of flight data. the flight data comes with the location and the time which I am using to pull weather for that time. Right now for 1000 instances, my ...
1vote
1answer
26views
Apache NiFi: Remove Username and Password login by default on UI
I'm currently working on setting up NiFi, I've noticed as a part of the Version 1.14.0 release, by default we have security features enabled which requires a username and password to access the UI. ...