1

HOWTO: Pivot a table with python

I have worked on finding a good way to pivot data and have finally settled on a method that works really well. One benefit of this is if new fields are added it automatically creates the additional fields by just opening up the definition and clicking save. Currently running this on a source table with 2 million rows it creates and loads the pivoted table in 34 seconds. 

The first() function was needed as the CUSTOMSTRINGVALUE is a string so rather than an aggregation function such as sum or avg I needed to use first().

I found that I needed to create a table with only the fields referenced in the python script and if I ran it on a table with addition fields it would fail silently.

The data is stored in the database like this:

ACCTNUMBER, CUSTOMFIELDCODE, CUSTOMSTRINGVALUE

123456, CODEA, VALUEA

123456, CODEB, VALUEB

The end result I needed was this:

ACCTNUMBER, CODEA, CODEB

123456, VALUEA, VALUEB

 

Here's the code that pivots the table:

from pyspark.sql.functions import first

df = read("ACCOUNTDATA.ACCOUNT_CUSTOMFIELD")

newdf = df.groupBy("ACCTNUMBER").pivot("CUSTOMFIELDCODE").agg(first("CUSTOMSTRINGVALUE"))

save(newdf)
Reply Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
Like1 Follow
  • 1 Likes
  • 3 wk agoLast active
  • 17Views
  • 1 Following