cancel
Showing results for 
Search instead for 
Did you mean: 
suxinji
Employee Alumni
Employee Alumni

As a part of data cleaning in Machine Learning, you may need to convert the data from one data type to the other data type. In this article, you will learn how to use Incorta Notebook to covert the data. 

After you read data in Incorta Notebook, you can use incorta.printSchema(df) to view the schema of the data frame.

suxinji_0-1646338922057.png

For example, in the House Price Dataset, the MSSubClass is a categorical feature but was encoded with numbers.  For using this column in the ML model, I would like to convert this column from the long type to string type. 

You can use withColumn and cast to convert the data type. The withColumn can be used to create a new column or update an existing one.  The data type of data can be directly casted to a different type.

 

from pyspark.sql.types import StringType
df = df.withColumn('MSSubClass', df["MSSubClass"].cast(StringType()))

# Short String
## df = df.withColumn('MSSubClass', df["MSSubClass"].cast('string'))

suxinji_1-1646338971707.png

 

To make sure you converted successfully. Using incorta.printSchema() to view the schema of the data frame.

As you can see below, the data type of the column MSSubClass has been converted to StringType. 

suxinji_2-1646338971735.png

 

 

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎03-08-2022 03:18 PM
Updated by: