Time Series Forecasting in an Incorta MV

dylanwan · ‎08-02-2023

Overview

This article will focus on Time Series forecasting using Auto Arima. We will get started with a small, simple example. There are several methods for time series forecasting; ARIMA is one of them and is a very popular time series modeling technique today.

What is the Time Series?

Time series analysis is used to predict future values based on past observations.

Time Series Application

Time series can be used in a wide variety of Applications! Here are a few examples to spark some inspiration

Healthcare: predict patient volume and attendance, disease outbreaks, medical resource demands, and medical supply requirements.
Marketing: predict the success of marketing campaigns, predict attendance to webinars or live events.
Agriculture: predict crop yields, planting timing, harvest timing, and machine maintenance
Supply chain: predict supply chain remains, optimize inventory levels, reduce stockouts, and improve order fulfillment.
Financial forecasting: predict stock prices, market trends, exchange rates, and commodity prices.
Health Monitoring: with the use of wearable devices, predict health-related issues and trends around heart rate, blood pressure, and sleep patterns.

Why use Auto Arima?

An ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting method that combines autoregression (AR), differencing (I), and moving average (MA) components to make predictions about future data points in a time series.

Usually, in the basic ARIMA model, we must provide the p,d, and q values.

p: Order of autoregressive component (number of lag observations).
d: Order of differencing (number of times the data is differenced).
q: Order of moving average component (number of lagged forecast errors)

Selecting and tuning p,d, and q values can be time-consuming. Enter AUTO ARIMA - a powerful model for forecasting time series data that automatically tunes the model.

Here is the webinar recording showing ARIMA in action

https://go.incorta.com/2022.06.07-asset-webinar-building-agile-data-pipeline-with-incorta-data-scien...

Best Practice

Preparing the data

Leverage an incorta analyzer table to prepare your data for modeling. When preparing the data, make sure to pay attention to some of the following common issues that could degrade the model prediction power and performance:

Missing values: Ensure data is imputed or removed.
Outliers: Eliminate data that strongly deviates from the remainder of your data.
Duplicate data: ensure duplicate entries are removed.
Historical relevance: Ensure the data used for the model is relevant to the current forecasting task.
Data Volume: Ensure sufficient volume is necessary for building a robust forecast.

We will use the Pandas-based framework and the pmdarima library in this case.

Adding a Library

In the below code example, we will use the Pandas-based framework called pmdarima library.

To install the library in your cloud, please refer to this article: Installing Python Packages in the cloud.

Run the model

Using the interactive notebooks (python), enter the following code. Note that if you are referencing another MV, the code of this MV will need to be added to a separate schema.

# load data 
df = read("Orders.Order_Source_Analyzer")
incorta.show(df)
# create temp view
df.createOrReplaceTempView('Order_Source_Analyzer')
# create a data frame, the SQL query generated from the Incorta Analyzer table
df_item_with_data = spark.sql("""
select  ORGANIZATION_CODE, 
        INVENTORY_ITEM_ID, 
        count(*) ROWCOUNT,
        MIN(ORDERED_DATE) EARLY_DATE,
        MAX(ORDERED_DATE) LATE_DATE,
        DATEDIFF( MAX(ORDERED_DATE),MIN(ORDERED_DATE) ) EXPECTED_DAYS,
        AVG(ORDERED_QUANTITY) AVG_QTY,
        STDDEV(ORDERED_QUANTITY) STDDEV_QTY,
        MIN(ORDERED_QUANTITY) MIN_QTY,
        MAX(ORDERED_QUANTITY) MAX_QTY
from    Order_Source_Analyzer 
--where INVENTORY_ITEM_ID = 11923
group by 
        ORGANIZATION_CODE, 
        INVENTORY_ITEM_ID 
having  count(*) > 300
""")

df_item_with_data.count()
incorta.show(df_item_with_data)

from pyspark.sql.functions import *
# filter org and inventory item, M2, 11923
df_item = df.filter("ORGANIZATION_CODE = 'M2' AND INVENTORY_ITEM_ID = 11923")

# optionally drop columns
# df_item = df_item.drop("ORGANIZATION_CODE","INVENTORY_ITEM_ID", "LINE_ID")

# We need to sort the data before sending the data into Pandas
df_item = df_item.sort("ORDERED_DATE")
incorta.show(df_item)

import pandas as pd
# convert Spark data frame to Pandas data frame
pdf = df_item.toPandas()
# set index
pdf = pdf.set_index(pdf['ORDERED_DATE'])

import matplotlib.pyplot as plt
# set figure size
plt.figure(figsize=(16,10))
# create Line Plot
plt.plot(pdf['ORDERED_QUANTITY'])
plt.xlabel('Date', fontsize=22)
plt.ylabel('Ordered QTY', fontsize=22)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Typically split the training and testing data are using the cut-off date to create a time series model.
startdate = pd.to_datetime("2006-3-1").date()
enddate = pd.to_datetime("2010-3-1").date()

outlier_data = pdf.loc[:startdate]
train_data = pdf.loc[startdate:enddate]
test_data =  pdf.loc[pdf['ORDERED_DATE'] > enddate]
len(pdf), len(outlier_data), len(train_data), len(test_data)
plt.plot(train_data['ORDERED_QUANTITY'])
plt.plot(test_data['ORDERED_QUANTITY'])
from pmdarima import auto_arima
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
train_data.head()
# apply auto_arima function
model = auto_arima(train_data['ORDERED_QUANTITY'], trace = True, suppress_warnings=True)
model.summary()
# predict data
prediction = pd.DataFrame(model.predict(n_periods = len(test_data)),index=test_data.index)
prediction.columns = ['FORECAST_ORDERED_QUANTITY']
prediction['ORDERED_DATE'] = prediction.index
# covert pandas data frame to spark data frame
output_df= spark.createDataFrame(prediction)
incorta.show(output_df)
save(output_df)

Resources

Agile Data Pipelines for Data Scientists

Installing a Python Package on Incorta Cloud

PMDARIMA documentation