In this recipe, you'll learn how to use select(), filter(), arrange(), and group_by() functions to handle Spark DataFrame on R.
- select()is use to extract columns
- filter()is used to filter the rows of the DataFrame
- arrange() is used to sort rows by columns
- group_by()is used to group the DataFrame by specified columns
You can use format select(df, df$column_name) or pass in the column name as a string select(df, "column_name").
About the Data Set - Titanic Survivor
This is a data set from Kaggle - Titanic - Machine Learning from Disaster.
- Each record is a Titanic passenger and can be identified by the passenger Id.
- Survived = 1 means that the passenger survived the disaster and Survived = 0 means that the passenger perished.
- Age is the age of the passenger by the number of years with decimal.
- Pclass is the ticket class, three ticket classes were provided, 1 is the 1st class, 2 is the 2nd class, and 3 is the 3rd class.
- Sex has two values, male and female.
- SibSp shows the # of siblings / spouses aboard the Titanic
- Parch shows the # of parents / children aboard the Titanic
- Ticket is the ticket number
- Fare is the fare price that the passenger paid
df_sparkr_select <- select(df, df$PassengerId, df$Name, df$Survived)
# You can also pass in column name as strings
df_sparkr_select <- select(df, "PassengerId", "Name", "Survived")
df_sparkr_filter <- filter(df, df$Survived == 0)
df_sparkr_arrange <- arrange(df, df$Pclass)
# compute the average for all numeric columns grouped by Pclass
df_sparkr_group_by <- avg(group_by(df_sparkr_select, df_sparkr_select$Pclass))
After using Spark DataFrame functions, use the display()Incorta Notebook Extension function to display data.
Use Select to get a subset of columns.
Use Filter to get only the passengers that perished.
Sort the data by Pclass
Get the average of Age and Fare by Pclass. The passengers in first class are older and paid higher fares.