on 03-08-2022 03:18 PM
As a part of data cleaning in Machine Learning, you may need to convert the data from one data type to the other data type. In this article, you will learn how to use Incorta Notebook to covert the data.
After you read data in Incorta Notebook, you can use incorta.printSchema(df) to view the schema of the data frame.
For example, in the House Price Dataset, the MSSubClass is a categorical feature but was encoded with numbers. For using this column in the ML model, I would like to convert this column from the long type to string type.
You can use withColumn and cast to convert the data type. The withColumn can be used to create a new column or update an existing one. The data type of data can be directly casted to a different type.
from pyspark.sql.types import StringType
df = df.withColumn('MSSubClass', df["MSSubClass"].cast(StringType()))
# Short String
## df = df.withColumn('MSSubClass', df["MSSubClass"].cast('string'))
To make sure you converted successfully. Using incorta.printSchema() to view the schema of the data frame.
As you can see below, the data type of the column MSSubClass has been converted to StringType.