on 04-06-2022 04:43 PM - edited on 04-14-2022 04:27 PM by Tristan
In this recipe, you'll learn how to use select(), filter(), arrange(), and group_by() functions to handle Spark DataFrame on R.
You can use format select(df, df$column_name) or pass in the column name as a string select(df, "column_name").
This is a data set from Kaggle - Titanic - Machine Learning from Disaster.
# select()
df_sparkr_select <- select(df, df$PassengerId, df$Name, df$Survived)
# You can also pass in column name as strings
df_sparkr_select <- select(df, "PassengerId", "Name", "Survived")
# filter()
df_sparkr_filter <- filter(df, df$Survived == 0)
# arrange()
df_sparkr_arrange <- arrange(df, df$Pclass)
# group_by()
# compute the average for all numeric columns grouped by Pclass
df_sparkr_group_by <- avg(group_by(df_sparkr_select, df_sparkr_select$Pclass))
After using Spark DataFrame functions, use the display()Incorta Notebook Extension function to display data.
Use Select to get a subset of columns.
Use Filter to get only the passengers that perished.
Sort the data by Pclass
Get the average of Age and Fare by Pclass. The passengers in first class are older and paid higher fares.