cancel
Showing results for 
Search instead for 
Did you mean: 
suxinji
Employee Alumni
Employee Alumni

Overview

In this recipe, you'll learn how to use select(), filter(), arrange(), and group_by() functions to handle Spark DataFrame on R.  

  • select()is use to extract columns
  • filter()is used to filter the rows of the DataFrame
  • arrange() is used to sort rows by columns
  • group_by()is used to group the DataFrame by specified columns

You can use format select(df, df$column_name) or pass in the column name as a string select(df, "column_name").

About the Data Set - Titanic Survivor

This is a data set from Kaggle - Titanic - Machine Learning from Disaster.

  • Each record is a Titanic passenger and can be identified by the passenger Id. 
  • Survived = 1 means that the passenger survived the disaster and Survived = 0 means that the passenger perished.
  • Age is the age of the passenger by the number of years with decimal.
  • Pclass is the ticket class, three ticket classes were provided, 1 is the 1st class, 2 is the 2nd class, and 3 is the 3rd class.
  • Sex has two values, male and female.
  • SibSp shows the # of siblings / spouses aboard the Titanic
  • Parch shows the # of parents / children aboard the Titanic
  • Ticket is the ticket number
  • Fare is the fare price that the passenger paid

Solution

# select()
df_sparkr_select <- select(df, df$PassengerId, df$Name, df$Survived)
# You can also pass in column name as strings
df_sparkr_select <- select(df, "PassengerId", "Name", "Survived")

# filter()
df_sparkr_filter <- filter(df, df$Survived == 0)

# arrange()
df_sparkr_arrange <- arrange(df, df$Pclass)

# group_by()
# compute the average for all numeric columns grouped by Pclass
df_sparkr_group_by <- avg(group_by(df_sparkr_select, df_sparkr_select$Pclass))

After using Spark DataFrame functions, use the display()Incorta Notebook Extension function to display data. 

Use Select to get a subset of columns.

Screen Shot 2022-04-06 at 10.13.31 AM.png

Use Filter to get only the passengers that perished.

Screen Shot 2022-04-06 at 10.13.54 AM.png

Sort the data by Pclass

Screen Shot 2022-04-06 at 10.14.34 AM.png

Get the average of Age and Fare by Pclass. The passengers in first class are older and paid higher fares.

Screen Shot 2022-04-06 at 10.15.20 AM.png

 

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎04-14-2022 04:27 PM
Updated by: