A Guide on Time Series Analysis in Python

Soumodeep Das
7 min readNov 11, 2022

--

Time series analysis

Introduction to time series analysis

Time series analysis (TSA) is one of the mathematical approaches where a series of data points are studied for a particular interval of time. In our daily world, effectiveness lies in predicting or forecasting the future. Time is considered the most important variable in TSA. The whole purpose of TSA is to understand how variables are changing over time.

In this article, we are going to predict the future international crude oil prices using a time series model. To download the dataset please visit Yahoo! Finance.

Table of contents

  1. What is TSA
  2. Components on TSA
  3. Different TSA models
  4. Steps involve in TSA
  5. TSA steps in Python
  6. Conclusion
  7. Takeaways

What is TSA

TSA is a mathematical approach to predicting or forecasting the future pattern of data using historical data arranged in a successive order for a particular time period.

Assumption: The only assumption in TSA is that the data is “stationary”, which means that the data is independent of time influence.

Components of TSA

Trends — Patterns inside data that reflect the series movement concerning time. The trend can be either linear or nonlinear in nature.

Seasonality — Data experience repetitive changes that recur every calendar year.

Cyclicity — Data experience changes that are not fixed and beyond the calendar year.

Randomness — Unknown, Irregular movements or changes in data.

Different TSA models

The TSA has different models like AR, MA, ARIMA, ARMA, etc. Within all of these models, ARIMA is the most frequently used model. Now, why ARIMA is used most frequently? We are not going to discuss these answers there.

TSA also provides us with additional information about the data points, but in this article, we are going to understand how to perform a time series analysis in Python.

Steps involve in TSA

  1. Plot the time series: Look for trends, seasonality, outliers, etc.
  2. Transform data so that the residuals are stationary: Log transforms or differencing.
  3. Fit the residuals: AR, MA, etc.

TSA steps in Python

Importing libraries

Pandas allow dealing with a data frame. statsmodels will allow us to do a TSA by importing several TSA tools. To check the seasonality of data ADF test is applied. matplotlib is imported to visualize the findings.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decomposef
rom statsmodels.tsa.arima_model import ARIMA
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

Loading the dataset:

In this article, the data loading process is Google Colab Notebook oriented. For another environment like anaconda, we need to follow the steps involved in anaconda to load the dataset.

We have used International Crude Oil price data to predict future oil prices. Data is from Nov-2012 to Sep-2022 (Monthly observations). Adjusted close price is the target variable.

# mounted google drive to colab
from google.colab import drive
drive.mount('/content/drive/')
# loading the dadasets available 
!ls "/content/drive/My Drive/datasets/"
filepath = "/content/drive/My Drive/datasets/"
data = pd.read_csv(filepath + 'Crude data.csv',index_col ='Date',parse_dates = True)

Replace ‘Crude data.csv’ with your file name.

**Note- Our dataset is stored in google drive that’s why we write code to mount the drive to colab.

data.columns

To make the data frame easy we need to drop the unnecessary columns from the dataset.

df=data.drop(['Number', 'Open', 'High', 'Low', 'Close*','Volume'], axis=1)
df.head()

Checking for null values in the dataset.

df.isnull().sum()

We can see that there is no null value present in our dataset.

You can also do some EDA with the data

Step 1: Plot the time series: Look for trends, seasonality, outliers, etc.

# ETS Decomposition
result = seasonal_decompose(df['Adj Close**'],model ='multiplicative')
# ETS plot
result.plot()

From the chart, we can say that our data is Multiplicative.

Step 2: Transform data so that the residuals are stationary: Log transforms or differencing.

ADF test is being done to check the seasonality of the data.

adfuller(df['Adj Close**'])

The P value is 0.32 which is more than 0.05 indicating that our data is not stationary. So we need to transform the data to stationary. Let’s use log transform to target the variable and transform it to stationery.

df['logarithm_base1'] = np.log2(df['Adj Close**'])
# Show the dataframe
df

After doing the log transform the P value comes to an acceptable range but if your P value is still not coming under the range then you need to do differencing and check the results until the P value comes under 0.05

**Additional codes are given for differencing.

data_d=df.diff(axis = 0, periods = 1)
data_d

Step 3: Fit the residuals: AR, MA, etc.

We are using ARIMA to forecast future oil prices.

In Python, there is a library named pmarima. Within this library, there is auto_arima which automatically tunes the parameters(p,d,q) where p is the number of autoregressive terms, d is the number of nonseasonal differences required for stationarity and q is the number of lagged forecast errors in the prediction equation.

# To install the library
!pip install pmdarima
# Import the library
from pmdarima import auto_arima
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
# Fit auto_arima function to dataset
stepwise_fit = auto_arima(df['data_d'], start_p = 1, start_q = 1,
max_p = 3, max_q = 3, m = 12,
start_P = 0, seasonal = True,
d = None, D = 1, trace = True,
error_action ='ignore', # we don't want to know if an order does not work
suppress_warnings = True, # we don't want convergence warnings
stepwise = True) # set to stepwise
# To print the summary
stepwise_fit.summary()

From the result, we got the optimal model for our data.

# Split data into train / test sets
train = df.iloc[:len(df)-12]
test = df.iloc[len(df)-12:] # set one year(12 months) for testing
# Fit a SARIMAX(0, 1, 1)x(2, 1, 1, 12) on the training set
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train['data_d'],
order = (0, 1, 1),
seasonal_order =(2, 1, 1, 12))
result = model.fit()
result.summary()

Visualize the prediction results and actual results

start = len(train)
end = len(train) + len(test) - 1
# Predictions for one-year against the test set
predictions = result.predict(start, end,
typ = 'levels').rename("Predictions")
# plot predictions and actual values
predictions.plot(legend = True)
test['data_d'].plot(legend = True)

Looking for the error. We measure MSE (Mean Square Error) to judge the accuracy.

# Load specific evaluation tools
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse
# Calculate root mean squared error
rmse(test["data_d"], predictions)
# Calculate mean squared error
mean_squared_error(test["data_d"], predictions)

In our case, the error is acceptable at 0.09 but the ideal value is 0.00 so a much close MSE value toward the ideal value is considered better.

Plotting future crude oil prices for the next few years.

# Train the model on the full dataset
model = model = SARIMAX(df['data_d'],
order = (0, 1, 1),
seasonal_order =(2, 1, 1, 12))
result = model.fit()
# Forecast for the next 3 years
forecast = result.predict(start = len(df),
end = (len(df)-1) + 3 * 12,
typ = 'levels').rename('Forecast')
# Plot the forecast values
df['data_d'].plot(figsize = (12, 5), legend = True)
forecast.plot(legend = True)

Printing the forecast values.

print(forecast)

The final prediction shows us that oil prices will fluctuate within the range of 6.01 to 6.55. These forecast values are in log form so we can return to the original prices using a mathematics formula.

Conclusion

From historical data, we can predict the future crude oil prices for the next three years. We found that the future crude oil price will fluctuate between 69 USD to 110 USD. We used the ARIMA model with an AIC value of 75.17 and BIC value of 83.16 along with that we can visualize future oil prices. python provides us with an optimal model but there are some limitations. We can still look for a better model with lower AIC and BIC values by setting p,d, and q values manually. We can use Gretl time-series software to come up with a better model.

--

--