SARIMA stands for Seasonal Autoregressive Integrated Moving Averages. It is a simple but quite powerful model to use for analyzing time series with seasonality.
For a good and detail explanation of SARIMA theory, please refer to the official page on statsmodels.org here https://www.statsmodels.org/stable/statespace.html
In this tutorial, I will not explain the theory behind SARIMA model, but will focus on the practical and application aspects of it only.
We will load a sample dataset to examine its trend and seasonality, then we will use a grid search algorithm to find the best-fit SARIMA model, and finally we will use the best-fit model to train with the historical data and predict the future.
First of all, let import all necessary libraries into our Jupyter notebook.
Let's load the sample dataset and display its head to see how our data looks.
This data is just a sample random dataset that I use Pandas to generate to artificially create trending and seasonality in the data for the purpose of this tutorial. The data has no meaning, it's just to present to the technical application aspect of SARIMA model.
Notice that this dataset is not a Time Series dataset yet. The year and month are in separate columns. We need to combine them into datetime and set it as the dataframe index to convert this dataset into a time series data. This is typical a scenario when we get real-life data.
Now, we can see that the index is no longer a numeric index, but changed to a time series index.
Let's plot the data to see how it looks.
As we can see obviously in the visualization, the data set is a time-series data with trending over time and has clear seasonality.
We use ETS Decomposition to extract the Trend and Seasonal attributes from the data.
As we can see in the ETS decomposition plot above, the 'Trend' and 'Seasonal' components are very clear.
We can also plot ACF and PACF to examine the auto-correlation of the time series data.
From the ACF and PACF, one more time we can see obviously the data are clearly seasonal.
At this point, we know we want to use SARIMA model to analyze this dataset. However, how we know with order and seasonal order to use for SARIMA model. We can split the data into train/test set and try many different combinations of params, i.e. orders, and then calculate the errors of each combination to find the smallest error combination. That's the best-fit model.
However, the is a more practical way, called grid search, which does exactly that by automatically test all combinations in order ranges provided and find out the best model.
Now, let's use this ARIMA grid search library called pyramid-arima to find the best-fit model.
The grid search normally takes quite a long time to find out the best model because it has to perform brute-force search for all combinations of order to find the one with smallest error.
In our auto_arima execution above, it found the SARIMAX with order (0, 1, 1) and seasonal order (2, 0, 2, 12) is the best model.
That's amazing! We don't have to do lots of data examinations to find the right model but just execute the grid search and sit-and-wait for it to find out for us.
Now, we can use split the data into train/test sets and test prediction of this model to see how well it predicts compare to the test data.
Checking the length of the data set.
Split train/test datasets.
Now, we can fit the model with train dataset.
After training the model, we can test prediction to see how close the prediction with the test data we have.
Let's plot to compare prediction and test data.
As we can see the prediction and the real test data are quite close.
Now, let's quantify the accurateness of the model using root mean square error.
Now we have a good model. Let's retrain the model for the whole dataset and try to forecast into the unknown future.
After having the forecast, we can plot the historical data and forecast to see how they align together.
We can see that the forecast data look reasonably accurate.
In conclusion, the ARIMA/SARIMA model is simple but can be powerful and effective in analyzing time series data.
This Jupyter notebook and the sample dataset can be found in my Github repository here.
Happy Forecasting!
Comments