Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Filter by
Sorted by
Tagged with
275votes
20answers
475kviews

How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = ...
user avatar
205votes
2answers
56kviews

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought ...
user avatar
  • 2,110
182votes
3answers
257kviews

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column', 10).head(5) -------------...
user avatar
  • 7,223
171votes
16answers
153kviews

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I ...
user avatar
  • 7,649
167votes
10answers
415kviews

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform ...
user avatar
  • 1,855
148votes
10answers
403kviews

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'),...
user avatar
  • 18k
144votes
10answers
318kviews

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then ...
user avatar
  • 4,485
142votes
11answers
307kviews

Convert spark DataFrame column to python list

I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and ...
user avatar
  • 2,515
138votes
16answers
147kviews

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that? PS: I want to check if it's empty so that I only ...
user avatar
  • 2,043
131votes
5answers
332kviews

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) ...
user avatar
128votes
9answers
295kviews

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id=...
user avatar
  • 1,291
128votes
6answers
260kviews

How to find the size or shape of a DataFrame in PySpark?

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this. In Python, I can do this: data.shape() Is there a similar function in PySpark? This ...
user avatar
  • 1,349
127votes
20answers
197kviews

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/...
user avatar
126votes
5answers
225kviews

How to kill a running Spark application?

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/...
user avatar
  • 17.6k
121votes
13answers
380kviews

Load CSV file with Spark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() ...
user avatar
  • 3,181

15 30 50 per page
1
2 3 4 5
2140