Use SparkR DataFrame for Selecting, Filtering, and...

suxinji · ‎04-06-2022

Overview

In this recipe, you'll learn how to use select(), filter(), arrange(), and group_by() functions to handle Spark DataFrame on R.

select()is use to extract columns
filter()is used to filter the rows of the DataFrame
arrange() is used to sort rows by columns
group_by()is used to group the DataFrame by specified columns

You can use format select(df, df$column_name) or pass in the column name as a string select(df, "column_name").

About the Data Set - Titanic Survivor

This is a data set from Kaggle - Titanic - Machine Learning from Disaster.

Each record is a Titanic passenger and can be identified by the passenger Id.
Survived = 1 means that the passenger survived the disaster and Survived = 0 means that the passenger perished.
Age is the age of the passenger by the number of years with decimal.
Pclass is the ticket class, three ticket classes were provided, 1 is the 1st class, 2 is the 2nd class, and 3 is the 3rd class.
Sex has two values, male and female.
SibSp shows the # of siblings / spouses aboard the Titanic
Parch shows the # of parents / children aboard the Titanic
Ticket is the ticket number
Fare is the fare price that the passenger paid

Solution

# select()
df_sparkr_select <- select(df, df$PassengerId, df$Name, df$Survived)
# You can also pass in column name as strings
df_sparkr_select <- select(df, "PassengerId", "Name", "Survived")

# filter()
df_sparkr_filter <- filter(df, df$Survived == 0)

# arrange()
df_sparkr_arrange <- arrange(df, df$Pclass)

# group_by()
# compute the average for all numeric columns grouped by Pclass
df_sparkr_group_by <- avg(group_by(df_sparkr_select, df_sparkr_select$Pclass))