on 08-02-2023 07:08 AM - edited on 08-02-2023 07:09 AM by JoeM
Time series analysis is used to predict future values based on past observations.
Time series can be used in a wide variety of Applications! Here are a few examples to spark some inspiration
An ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregression (AR), differencing (I), and moving average (MA) components to make predictions about future data points in a time series.
Leverage an incorta analyzer table to prepare your data for modeling. When preparing the data, make sure to pay attention to some of the following common issues that could degrade the model prediction power and performance:
We will use the Pandas-based framework and the pmdarima library in this case.
In the below code example, we will use the Pandas-based framework called pmdarima library.
To install the library in your cloud, please refer to this article: Installing Python Packages in the cloud.
Using the interactive notebooks (python), enter the following code. Note that if you are referencing another MV, the code of this MV will need to be added to a separate schema.
# load data
df = read("Orders.Order_Source_Analyzer")
incorta.show(df)
# create temp view
df.createOrReplaceTempView('Order_Source_Analyzer')
# create a data frame, the SQL query generated from the Incorta Analyzer table
df_item_with_data = spark.sql("""
select ORGANIZATION_CODE,
INVENTORY_ITEM_ID,
count(*) ROWCOUNT,
MIN(ORDERED_DATE) EARLY_DATE,
MAX(ORDERED_DATE) LATE_DATE,
DATEDIFF( MAX(ORDERED_DATE),MIN(ORDERED_DATE) ) EXPECTED_DAYS,
AVG(ORDERED_QUANTITY) AVG_QTY,
STDDEV(ORDERED_QUANTITY) STDDEV_QTY,
MIN(ORDERED_QUANTITY) MIN_QTY,
MAX(ORDERED_QUANTITY) MAX_QTY
from Order_Source_Analyzer
--where INVENTORY_ITEM_ID = 11923
group by
ORGANIZATION_CODE,
INVENTORY_ITEM_ID
having count(*) > 300
""")
df_item_with_data.count()
incorta.show(df_item_with_data)
from pyspark.sql.functions import *
# filter org and inventory item, M2, 11923
df_item = df.filter("ORGANIZATION_CODE = 'M2' AND INVENTORY_ITEM_ID = 11923")
# optionally drop columns
# df_item = df_item.drop("ORGANIZATION_CODE","INVENTORY_ITEM_ID", "LINE_ID")
# We need to sort the data before sending the data into Pandas
df_item = df_item.sort("ORDERED_DATE")
incorta.show(df_item)
import pandas as pd
# convert Spark data frame to Pandas data frame
pdf = df_item.toPandas()
# set index
pdf = pdf.set_index(pdf['ORDERED_DATE'])
import matplotlib.pyplot as plt
# set figure size
plt.figure(figsize=(16,10))
# create Line Plot
plt.plot(pdf['ORDERED_QUANTITY'])
plt.xlabel('Date', fontsize=22)
plt.ylabel('Ordered QTY', fontsize=22)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Typically split the training and testing data are using the cut-off date to create a time series model.
startdate = pd.to_datetime("2006-3-1").date()
enddate = pd.to_datetime("2010-3-1").date()
outlier_data = pdf.loc[:startdate]
train_data = pdf.loc[startdate:enddate]
test_data = pdf.loc[pdf['ORDERED_DATE'] > enddate]
len(pdf), len(outlier_data), len(train_data), len(test_data)
plt.plot(train_data['ORDERED_QUANTITY'])
plt.plot(test_data['ORDERED_QUANTITY'])
from pmdarima import auto_arima
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
train_data.head()
# apply auto_arima function
model = auto_arima(train_data['ORDERED_QUANTITY'], trace = True, suppress_warnings=True)
model.summary()
# predict data
prediction = pd.DataFrame(model.predict(n_periods = len(test_data)),index=test_data.index)
prediction.columns = ['FORECAST_ORDERED_QUANTITY']
prediction['ORDERED_DATE'] = prediction.index
# covert pandas data frame to spark data frame
output_df= spark.createDataFrame(prediction)
incorta.show(output_df)
save(output_df)
Agile Data Pipelines for Data Scientists