Time Series Forecasting in Python

Published: August 28, 2025

tags: python, machine-learning, time-series, forecasting

In this post we will be introducing the concept of time series forecasting, with a focus on the ARIMA framework and how this can be implemented in Python. We will be using a publicly available data set and the following open source packages:

Time series

In time series analysis we are interested in sequential data made up of a series of observations taken at regular intervals. Examples include:

Weekly hospital occupancy
Monthly sales figures
Annual global temperature

In many cases we want to use the observations up to the present day to predict (or forecast) the next N time points. For example, a hospital could reduce running costs if an appropriate number of beds are provisioned.

This is where time series modelling fits in. The most basic time series model is a simple linear regression, where we assume that the time series evolves linearly over time. For non-linear time series we can consider piecewise linear regression.

What about more complex cases where we want to accurately capture subtle variations in the data? We will now demonstrate the ARIMA framework in Python using a real world data set.

ARIMA

ARIMA stands for “Auto-Regressive Integrated Moving Average” and is made up of three key parts:

Auto-regression: captures the relationship between an observation and the last k points (often referred to as “lagged” observations).
Integration: accounts for “non-stationary” trends by taking the difference between consecutive observations (a non-stationary trend could include an overall upward trend where the mean observation is increasing over time).
Moving average: accounts for the relationship between an observation and the residual error that would result from using a moving average model applied to the lagged observations.

The three components (AR, I, MA) are controlled by the parameters (p, d, q). Setting one of these to zero will eliminate that component of the model. For example, if the time series already appears to be stationary we could set d = 0 so that we do not perform differencing.

To demonstrate ARIMA on a real-world example, let’s load in the flights data set from the seaborn library:

import seaborn as sns

flights = sns.load_dataset("flights")
flights.head()

##    year month  passengers
## 0  1949   Jan         112
## 1  1949   Feb         118
## 2  1949   Mar         132
## 3  1949   Apr         129
## 4  1949   May         121

Let’s visualise the data:

import matplotlib.pyplot as plt

plt.plot(flights["passengers"])
plt.xlabel("month")
plt.ylabel("passengers")

Time series plot displaying the number of flight passengers every month for a 12-year period. We observe a fluctuating seasonal trend with a clear peak in the summer months of each year. Both the annual average and the size of the seasonal fluctuation appear to be increasing over time. At month 0 we observe just over 100 passengers and at month 144 we observe over 400 passengers.

The data includes the number of passengers that flew each month over a period of 12 years. We will start by fitting a model on the full data set, then try holding out some test data for forecasting.

Data inspection

We should begin by exploring the time series. There are a number of questions that could be asked. For example:

Is the trend non-stationary?
Does the plot feature a seasonal variation?

Just from looking at the plot above the answer to both of these questions is a clear “yes”! But what if the data was more noisy and it was not clear from a quick visual inspection? In that case we could try decomposing the time series into the following components:

Trend
Seasonal
Residual (i.e. after we have subtracted the trend and seasonal components)

Fortunately the statsmodels library has a seasonal_decompose() function for this exact purpose:

from statsmodels.tsa.seasonal import seasonal_decompose

y = flights["passengers"] # convenience variable
decomposition = seasonal_decompose(y, period=12)

For convenience we have assigned the passengers column of the original DataFrame to a variable called y. Because we expect a seasonal variation in the data we have chosen a period of 12 months. Let’s inspect the decomposition:

decomposition.plot()

Four plots arranged vertically showing a breakdown of the flight passengers time series. Viewed from top to bottom: the first panel displays the raw time series; the second panel displays the trend component which is increasing over time at a rate that is close to linear; the third panel displays the repeating seasonal fluctuation, which spans from -50 to +50; and the fourth panel displays the residuals that remain after the trend and seasonal variation have been subtracted. The residuals do not appear random, featuring instead a fluctuating pattern, suggesting that there are additional seasonal effects not accounted for.

The top panel shows the original raw time series.
In the second panel we see that there is indeed an increasing trend. We will therefore need to include some differencing in the model (d > 0).
The third panel shows the repeating seasonal component.
The fourth panel shows that there is still a non-random residual after the trend and seasonal component have been subtracted. This can result from the fact that the seasonal “peaks” in the original plot appear to grow in amplitude over time (i.e. it is not really a fixed seasonal pattern).

It is also important to study the autocorrelation function (ACF) and partial autocorrelation function (PACF). Again, the statsmodels library has everything we need:

The ACF plot shows how correlated observations are with other observations that are k time points away (we call this “lag-k”):

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plot_acf(y)
plt.xlabel("$k$")

Plot of the autocorrelation function applied to the raw time series data. We observe an autocorrelation of 1 at lag-0, which is guaranteed in autocorrelation plots. The autocorrelation then gradually decreases to roughly 0.5 at lag-20. A shaded confidence region is also plotted, which the autocorrelation drops inside at approximately lag-14, suggesting that the correlation between time points that are separated by more than 14 months is not statistically significant.

The PACF plot shows the “direct” correlation between observations at lag-k after removing the linear dependence of intermediate lags:

plot_pacf(y)
plt.xlabel("$k$")

Plot of the partial autocorrelation function applied to the raw time series data. As expected it starts at value 1 at lag-0, and rapidly drops to within 0.2 at lag-3. The partial autocorrelation remains fluctuating between -0.2 and +0.2 (aside from a few outliers) until lag-22. This suggests that the direct correlation, after the linear dependence of intermediate lags is subtracted, between time points that are separated by more than two months is statistically insigificant.

Both plots start with a lag of 0, where the correlation is always 1. The ACF and PACF then typically drop down to close to zero. The point at which this happens can help to inform the values for our p and q parameters:

The value of k at which the ACF reduces to statistically insignificant values is regarded as a good choice for the q parameter. From the plot we see the ACF drops close to the confidence region at approximately k = 10.
The value of k at which the PACF appears to drop close to 0 is a sensible choice for the p parameter. Here the value k = 2 appears reasonable.

These are really just educated guesses. In practice it would be worth experimenting with the ranges 1 < = p < = 3 and 5 < = q < = 15 (we’ll not worry about this here).

The d parameter controls the amount of differencing:

d = 1 means we take the difference between every observation and the previous observation.
d = 2 means we difference the differenced time series again.
… and so on.

The process should continue until the non-stationary trend is regarded as statistically insignificant. This can be done by eye, but a better way is to use the Augmented Dickey-Fuller (ADF) test:

from statsmodels.tsa.stattools import adfuller

adfuller(y)

## (0.8153688792060528, 0.9918802434376411, 13, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 996.6929308390189)

The first two values returned give us the test statistic and p-value for the null hypothesis, respectively. We also get the critical value cutoffs at the 1%, 5% and 10% levels. Without going into too much detail, the general rule is that if the test statistic is greater than the 5% cutoff then the null hypothesis is accepted, meaning that the trend is non-stationary.

In our case we should consider differencing the data and trying again. Let’s use the diff() function from statsmodels:

Taking a single difference results in a test statistic that is comparable to the 5% cutoff:

from statsmodels.tsa.statespace.tools import diff

adfuller(diff(y))

## (-2.829266824169999, 0.05421329028382552, 12, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 988.5069317854084)

Taking a second difference results in a test statistic that is much lower than even the 1% cutoff:

from statsmodels.tsa.statespace.tools import diff

adfuller(diff(y, k_diff=2))

## (-16.38423154246854, 2.7328918500140445e-29, 11, 130, {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 988.6020417275605)

We see that a second order difference (d = 2) produces a stationary trend according to the ADF test. It is important to avoid excessively high choices of d since this can introduce artefacts in the final model. So in practice it would be worth experimenting with both d = 1 and d = 2.

What about the seasonal trend in the data? This suggests that we should really be using the SARIMA framework (where the S stands for “seasonal”). That would involve twice the number of parameters, so let’s proceed with our simplified model and see how we get on.

Model fitting

Having analysed the time series we have arrived at a reasonable choice of our parameters: (p, d, q) = (2, 2, 10). As stated above, in practice we should really test a range of values but for now we will not worry.

The statsmodels library provides an ARIMA object for model fitting and forecasting. Let’s call it with our parameter choices:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(y, order=(2, 2, 10))
model_fit = model.fit()

We can inspect the model using:

model_fit.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:             passengers   No. Observations:                  144
## Model:                ARIMA(2, 2, 10)   Log Likelihood                -673.434
## Date:                Wed, 27 Aug 2025   AIC                           1372.867
## Time:                        14:21:20   BIC                           1411.293
## Sample:                             0   HQIC                          1388.482
##                                 - 144                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## ar.L1          0.0376      0.027      1.386      0.166      -0.016       0.091
## ar.L2         -0.9770      0.026    -37.090      0.000      -1.029      -0.925
## ma.L1         -0.4439    153.608     -0.003      0.998    -301.510     300.622
## ma.L2          0.9971    142.058      0.007      0.994    -277.432     279.426
## ma.L3         -0.4440     90.869     -0.005      0.996    -178.543     177.655
## ma.L4          0.2001    163.089      0.001      0.999    -319.449     319.849
## ma.L5         -0.2122    163.925     -0.001      0.999    -321.499     321.075
## ma.L6          0.2396    165.114      0.001      0.999    -323.377     323.857
## ma.L7         -0.0025     79.453  -3.19e-05      1.000    -155.728     155.723
## ma.L8         -0.6393    156.926     -0.004      0.997    -308.208     306.930
## ma.L9          0.1965    136.829      0.001      0.999    -267.984     268.377
## ma.L10        -0.8908      0.125     -7.104      0.000      -1.137      -0.645
## sigma2       660.0238      1.030    640.771      0.000     658.005     662.043
## ===================================================================================
## Ljung-Box (L1) (Q):                   0.42   Jarque-Bera (JB):                10.61
## Prob(Q):                              0.52   Prob(JB):                         0.00
## Heteroskedasticity (H):               6.53   Skew:                             0.08
## Prob(H) (two-sided):                  0.00   Kurtosis:                         4.33
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## [2] Covariance matrix is singular or near-singular, with condition number 1.78e+22. Standard errors may be unstable.
## """

The summary of the fit provides the log likelihood, AIC and BIC metrics. If you’re testing different choices of the (p, d, q) parameters it’s worth comparing the AIC and BIC metrics (lower values suggest a better fit).

The model summary also includes a couple of warnings, in this case concerning the covariance matrix. We will not worry about these messages for now, and inspect model residuals and forecasting ability as a way of assessing the quality of the fit.

Using pandas we can inspect the residuals:

import pandas as pd

residuals = pd.DataFrame(model_fit.resid)
residuals.describe()

##                 0
## count  144.000000
## mean     0.489740
## std     28.441220
## min    -91.170547
## 25%    -14.729881
## 50%     -0.702994
## 75%     16.190037
## max    112.000000

The residuals appear to be distributed close to zero.

residuals.plot(kind="kde", legend=False)
plt.xlabel("residuals")
plt.ylabel("density")

Kernel density estimation plot showing how the ARIMA model residuals are distributed. We observe a symmetrical distribution that is centred close to value 0.

We can also plot the residuals over time to inspect the outliers.

residuals.plot(legend=False)
plt.hlines(0, 0, 144, color="black")
plt.xlabel("month")
plt.ylabel("residuals")

Plot showing the ARIMA model residuals plotted against time. We observe a fluctuating trend which appears random, that is centred on the horizontal zero line. Most of the residuals are within -50 and +50.

For an initial model this appears reasonable.

Forecasting

Now that we have a model we can try forecasting future time points. There are a number of possible use cases:

We may only be interested in forecasting the next month. We can simulate this with our data set by using a “rolling forecast” where the model is retrained on all of the data up to the current time point before predicting the next time point.
The model could also be used for quarterly or yearly forecasting, where we predict multiple future time points at once.

Let’s go with approach 1 first. We will start by splitting the time series into an initial training set and a hold-out test set:

y_values = list(y.values)
train, test = y_values[:96], y_values[96:132]

Since time series models are typically used to forecast into the future, a common practice for testing is to remove the end of the time series from the training set and hold it out for testing. Here we have set aside 3 years of data for testing.

We will now simulate 3 years worth of monthly forecasting, where every month we retrain the model with the latest data and produce a forecast for the next month. Forecasts are produced using the .forecast() method, which predicts the next time point by default.

predictions = []
current_params = None
for i in range(len(test)):
    model = ARIMA(train, order=(2, 2, 10))
    model_fit = model.fit(start_params=current_params)
    current_params = model_fit.params # update the parameters
    output = model_fit.forecast()
    predictions.append(output[0]) # store the prediction
    train.append(test[i]) # update the training set

Depending on the model complexity this can take a few minutes to run (the above code chunk took 1-2 minutes). To save some optimisation time, at every time step we have used the best-fit parameters produced for the previous model as the starting parameters for the next model (using the start_params argument).

We now have a list of predictions to compare against our test observations. Let’s plot these together and compute the root-mean-squared error (RMSE):

from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(test, predictions))
print(f"RMSE: {rmse}")

## RMSE: 30.875187739583858


plt.plot(test, color="blue", label="observed")
plt.plot(predictions, color="red", label="ARIMA")
plt.xlabel("month")
plt.ylabel("passengers")
plt.legend()

Plot showing the passengers time series for three years of monthly data. The observed time series is plotted in blue and the ARIMA rolling predictions are plotted in red. The predictions appear to align closely with the observations and capture the seasonal pattern.

The agreement looks reasonable.

Alternatively, we may want to predict all 12 months in the next year. Let’s use the final 12 months of data (which were left out of the above analysis):

test = y_values[132:]

We now retrain the model on the first 11 years worth of data and this time use it to forecast the next 12 months:

model = ARIMA(train, order=(2, 2, 10))
model_fit = model.fit()

output = model_fit.forecast(steps=12) # predict 12 time points
predictions = output[:12] # store the predictions

Let’s compare the predictions with the test observations:

rmse = np.sqrt(mean_squared_error(test, predictions))
print(f"RMSE: {rmse}")

## RMSE: 73.86518520797583


plt.plot(test, color="blue", label="observed")
plt.plot(predictions, color="red", label="ARIMA")
plt.xlabel("month")
plt.ylabel("passengers")
plt.legend()

Plot showing the passenger time series for a year of monthly data. We again show the observations in blue. In red we show the 12-month forecast produced by the ARIMA model. The agreement is poor here. The predictions produce a noisy trend that gradually increases over time, from 450 passengers at month 0 to over 500 passengers at month 12. This disagrees with the observations which display a maximum of over 600 passengers at month 6 and minima of less than 400 passengers at months 1 and 10, suggesting that the model has failed to capture the seasonality of the data.

The agreement is poor here. As noted earlier, ARIMA does not account for seasonal variation and we can see here the model is not able to reproduce the peak at month 6. It would therefore be worth repeating this analysis with the SARIMA method, which is also implemented in statsmodels.

Summary

In summary, we have introduced the ARIMA framework for time series forecasting using a real world example in Python. Along the way, we have learned about popular data visualisations for time series data and explored the time series analysis functions provided by the statsmodels package. Check out the statsmodels documentation for more examples.

It’s worth mentioning that, while ARIMA is a powerful method for time series forecasting, there are a number of other popular frameworks for different use cases:

SARIMA: expands on ARIMA by including a seasonal variation.
Prophet: an alternative time series framework that can capture yearly, weekly and daily seasonality.
DeepAR: an efficient deep learning algorithm designed to fit multiple time series with a single global model. This can outperform ARIMA in scenarios where hundreds of time series have to be modelled.

We may revisit these models in a later post. In the meantime, check out our recent blog series on MLOps, including model versioning, deployment and monitoring using the Vetiver framework.