Questions tagged [pyspark]
The Spark Python API (PySpark) exposes the apache-spark programming model to Python.
32,098
questions
275votes
20answers
475kviews
How to change dataframe column names in pyspark?
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = ...
205votes
2answers
56kviews
Spark performance for Scala vs Python
I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.
With that assumption, I thought ...
182votes
3answers
257kviews
How to add a constant column in a Spark DataFrame?
I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
-------------...
171votes
16answers
153kviews
How to turn off INFO logging in Spark?
I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.
However, I ...
167votes
10answers
415kviews
How do I add a new column to a Spark DataFrame (using PySpark)?
I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.
I've tried the following without any success:
type(randomed_hours) # => list
# Create in Python and transform ...
148votes
10answers
403kviews
Filter Pyspark dataframe column with None value
I'm trying to filter a PySpark dataframe that has None as a row value:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),...
144votes
10answers
318kviews
Show distinct column values in pyspark dataframe
With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().
I want to list out all the unique values in a pyspark dataframe column.
Not the SQL type way (registertemplate then ...
142votes
11answers
307kviews
Convert spark DataFrame column to python list
I work on a dataframe with two column, mvv and count.
+---+-----+
|mvv|count|
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
i would like to obtain two list containing mvv values and ...
138votes
16answers
147kviews
How to check if spark dataframe is empty?
Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?
PS: I want to check if it's empty so that I only ...
131votes
5answers
332kviews
How to change a dataframe column from String type to Double type in PySpark?
I have a dataframe with column as String.
I wanted to change the column type to Double type in PySpark.
Following is the way, I did:
toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
...
128votes
9answers
295kviews
How to delete columns in pyspark dataframe
>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id=...
128votes
6answers
260kviews
How to find the size or shape of a DataFrame in PySpark?
I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.
In Python, I can do this:
data.shape()
Is there a similar function in PySpark? This ...
127votes
20answers
197kviews
importing pyspark in python shell
This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/...
126votes
5answers
225kviews
How to kill a running Spark application?
I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource.
I did some quick research and people suggested using YARN kill or /bin/...
121votes
13answers
380kviews
Load CSV file with Spark
I'm new to Spark and I'm trying to read CSV data from a file with Spark.
Here's what I am doing :
sc.textFile('file.csv')
.map(lambda line: (line.split(',')[0], line.split(',')[1]))
.collect()
...