HOWTO: Pivot a table with python
I have worked on finding a good way to pivot data and have finally settled on a method that works really well. One benefit of this is if new fields are added it automatically creates the additional fields by just opening up the definition and clicking save. Currently running this on a source table with 2 million rows it creates and loads the pivoted table in 34 seconds.
The first() function was needed as the CUSTOMSTRINGVALUE is a string so rather than an aggregation function such as sum or avg I needed to use first().
I found that I needed to create a table with only the fields referenced in the python script and if I ran it on a table with addition fields it would fail silently.
The data is stored in the database like this:
ACCTNUMBER, CUSTOMFIELDCODE, CUSTOMSTRINGVALUE
123456, CODEA, VALUEA
123456, CODEB, VALUEB
The end result I needed was this:
ACCTNUMBER, CODEA, CODEB
123456, VALUEA, VALUEB
Here's the code that pivots the table:
from pyspark.sql.functions import first df = read("ACCOUNTDATA.ACCOUNT_CUSTOMFIELD") newdf = df.groupBy("ACCTNUMBER").pivot("CUSTOMFIELDCODE").agg(first("CUSTOMSTRINGVALUE")) save(newdf)