Time Series forecasting with ARIMA

Jayanti prasad Ph.D
8 min readJun 14, 2023

--

Before I begin writing, allow me to provide a few general comments:

  • This tutorial utilizes synthetic data, which is advantageous because you won’t need to acquire any external data that cannot be modified. Here, you have the freedom to alter the data and adjust the modeling process to make it more challenging or easier (please refer to the function provided at the end for creating synthetic data).
  • Additionally, it is worth noting that certain intricate theoretical topics, such as “unit roots,” will be explained elsewhere.

If you need any further assistance or have specific questions, please don’t hesitate to ask. I’m here to help!

Let’s get started !

The full form of ARIMA is AutoRegressive Integrated Moving Average. There are three different concepts involved here: (1) AutoRegressive, (2) Integration (actually differentiation), and (3) Moving Average.

Before delving into time series forecasting and how ARIMA aids in that, let me explain what time series data is.

Time series data is a type of one-dimensional data where we have measurements at fixed time intervals, such as hourly, daily, weekly, monthly, yearly, or any other fixed interval. The crucial condition is that the interval remains consistent.

Another important characteristic of time series data is that the data points are not random; they exhibit some form of correlation. For instance, what happens today to some extent depends on what happened yesterday and perhaps even the day before that, and so on. The correlation may extend far back in time over any number of intervals, or it can be zero, indicating that what happens on a given day is entirely independent of what occurs on any other day (referred to as white noise).

The value of a time series data at any time instant ‘t’ can be represented as X(t), where ‘t’ is typically a discrete variable. For example, we can have t = 1, 2, 3, 4, and so on.

Realistically, the value of X(t), such as the temperature or stock price of a particular stock, may depend on some underlying processes that we may or may not understand. Time series forecasting often only requires knowledge of past data values. This universality and domain independence make time series forecasting powerful, as we don’t need to be an expert in the financial market or meteorology to forecast stock prices or temperatures (although domain knowledge can certainly be helpful!).

Similar to other types of data, time series data also contains noise that needs to be modeled. Without noise, there would be nothing to forecast, as everything would be deterministic.

Theory :

  • Auto regression

X(t) = phi_1 * X(t-1) + phi_2 * X(t-2) + ….. phi_p * X(t-p) + n_t

This equation says that the current value X(t) depends on past ‘q’ values and ‘n_t’ is the noise term.

  • Moving Average

This is misleading ! There is no moving average, what it tells is that the current value depends on past ‘error/noise’ terms or ‘mismatch’ terms

X(t) = psi_1 * E (t-1) + psi_2 * E(t-2) + …. psi_q * E(t-q) + n_t

  • Integration

There is no integration ! We need to find out after how many differentiation the time series becomes stationary.

Note that here I have omitted a lot of theory !

Let us move forward now.

There are some key properties / features of Time Series data which play an important role in the forecasting and those are as follows:

  1. Stationarity

If the statistical properties such as mean, variance and others for any stretch of data in Time Series are the same as that of any other stretch of data then we call the Time Series stationary. Before we start ARIMA on any Time series data we need to find out whether it is stationary or not and for that there are many statistical tests (hypothesis testing). One of the important tests is Dickey-Fuller test. Here without going in detail of the tests (which I will cover somewhere else) here we will focus on how we can use it to find out whether a time series is stationary or not.

Like in any other hypothesis testing, where also we have the following items.

  • Null hypothesis (H0) : Time series is not stationary.
  • Alternative Hypothesis (H1) : Time Series is stationary
  • Test statistic or score
  • Significance level (alpha), generally 0.5

Now we have to watch for :

  • if p-value is smaller than 0.5 and test score is large that means the Null hypothesis is false and Time series is stationary.
Figure 1: A stationary Time Series

For the Time Series shown in Figure 1, which is very close to stationary, here are the results of Dickey Fuller test.

Figure 2: results of Dickey-Fuller test for Time Series shown in Figure 1
Figure 3: Non stationary time series
Figure 4: Results of the Dickey fuller test for the time series shown in Figure 3

From the above discussion it is clear that the Dickey fuller test does pick the case for the non-stationary.

2. Trend

In general a trend, as shown in Figure 3, makes a time series non-stationary and in order to make that stationary we can differentiate the time series a few time. The number of times we need to differentiate to make a time series stationary is one of the important parameters (d) of the ARIMA model. In most case, one of two time differentiation is enough. See the following example,

Figure 5 : We can find out after how many differentiation time series becomes stationary

Here are the results of Dicky-Fuller test after one differentiation:

Figure 6: Diceky fuller test summary of the 1st differentiation.

3. Auto-Correlation

There are two other parameters p & q, apart from (d) and in order to tune their values we need to look at the following two type of auto-correlations.

  • Auto-correlation (for ‘q’ parameter)
  • Partial auto-correlation (for ‘p’ parameter )
Figure 7: Partial Auto-correlation and correlation function.

Partial Auto-correlation does not consider the intermediate data points. From the Figure (7) it is clear that p =4 (look at partial auto-correlation) and q=1 (look at Auto-correlation function) should be good choices.

4. Fitting ARIMA model

Once we have some idea about the parameters p,q and q of ARIMA we can split the data into train & test and fit the model. We can make prediction for the test data and compare the actual test data points with the predicted data points and see how good is our forecasting.

Figure 8: Fitting ARIMA
Figure 9: Summary of fitting of ARIMA model

5. Predictions

After splitting the data into train and test we have made predictions for the test data and compare the actual data with the prediction in the following figures. As a perfiormance metrices we compute the measn squred error as well as R2 on the test data.

Figure 10 : Forecasting of the trained model on the test data and comparison.

Full Code :

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf,acf, pacf
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

np.random.seed (seed=202)

def create_data ():
"""
For the current example, we will use syntheis data which is created using this function.
Here users have options to add:
1) control the trend
2) add one or more seasonality components
3) control gaussian noise
This may help a lot in understanding how we tune the model for different type of data.
"""

dates = pd.date_range (start='01/01/2020',end='31/12/2020')
days = np.linspace (0, len (dates)-1,len (dates))
y_trend = 0.2 + 0.1 * days
y_seasn = 3.0 * np.sin (2.0 * np.pi * days/7) + 1.0 * np.sin (2.0 * np.pi * days/14)
y_noise = np.random.normal (0.0,2.0,[len(days)])
data = y_trend + y_seasn + y_noise
X = pd.Series (data=data, index=dates)
return X


def adf_test(timeseries):
"""
This module test the stationarity using Augumnted Dicky Fuller test
"""

print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
if dfoutput['p-value'] < 0.05:
print("Time Series is stationary")
else:
print("Time series is not stationary")


if __name__ == "__main__":
"""
This is main program
"""

# Get the data first
data = create_data ()

# Plot and see the data
fig, axs = plt.subplots (1,1,figsize=(12,6))
axs.plot(data)
plt.show()

# chenck stationarity
adf_test (data)

# Plot original as well as after one and two differentiation
# Original Series
fig, axs = plt.subplots(3,1,figsize=(12,12))
axs[0].plot(data)
axs[0].set_title('Original Series')
axs[0].axes.xaxis.set_visible(False)

# 1st Differencing
axs[1].plot(data.diff())
axs[1].set_title('1st Order Differencing')
axs[1].axes.xaxis.set_visible(False)

# 2nd Differencing
axs[2].plot(data.diff().diff())
axs[2].set_title('2nd Order Differencing')
plt.show()

# we can check again ADF on 1st differentiation
adf_test (data.diff().dropna())

# Now check auto-correlation function for finding the parameter 'q'
auto_cor = acf (data.values.squeeze())

# Now check the partial auto-correlation function for finding the
# parameter 'p
pauto_cor = pacf (data.values)

fig, axs = plt.subplots (1,1,figsize=(12,6))
axs.plot(auto_cor,label='Auto-Correlation')
axs.plot(pauto_cor,label='Partial Auto-correlation')
axs.grid()
axs.legend()
plt.show()

# we can use inbuilt functions also for plotting
plot_acf (data.values,lags=40)
plot_pacf (data.values,lags=40)

#Now let us fit the ARIMA model

# split the data into train & test parts
data_train = data.iloc[:-10]
data_test = data.iloc[-10:]

# set the parameters
p = 4
d = 1
q = 2
model = ARIMA(data_train, order=(p,d,q))

model_fit = model.fit()
print(model_fit.summary())

# line plot of residuals
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()

# density plot of residuals
residuals.plot(kind='kde')
plt.show()

# summary stats of residuals
print(residuals.describe())
results = model_fit.get_forecast(10,alpha=0.05)
# # 95% conf
df1 = results.summary_frame()
print(df1.columns)

lower = df1['mean_ci_lower']
upper = df1['mean_ci_upper']
print("model",model_fit.summary())

# Now we can make predictions for the test data
data_pred = model_fit.predict(start=len(data_train),end=len(data)-1,dynamic=True)
fig, axs = plt.subplots(2,1,figsize=(12,12))

axs[0].plot(data_train,label='Training data')
axs[0].plot(data_test,'g-',label='Test data')
axs[0].plot(data_pred,'r-',label='Predictions')
axs[0].fill_between(lower.index, lower, upper, color='bisque',label='95 % confidence')
axs[0].legend()

# just plot the later section
axs[1].plot(data_train.iloc[-40:],label='Training data')
axs[1].plot(data_test,'g-',label='Test data')
axs[1].plot(data_pred,'r-',label='Predictions')
axs[1].fill_between(lower.index, lower, upper, color='bisque',label='95 % confidence')
axs[1].legend()
plt.show()

# now let us check the performance metrices
y_true = data_test.values
y_pred = data_pred.values

mse = mean_squared_error (y_true, y_pred)
r2= r2_score(y_true, y_pred)
print("mse",mse)
print("r2=",r2)

The code given above is careful checked and if you copy & paste it in jupter notebook it should work !

Please share your comments & feedback. I know I have left many theoretical details here which I will cover in an another article on Time Series data.

--

--

Jayanti prasad Ph.D
Jayanti prasad Ph.D

Written by Jayanti prasad Ph.D

Physicist, Data Scientist and Blogger.

No responses yet