Time Series : Exploratory Data Analysis

8 min readFeb 27, 2023

In this section I will discuss various techniques used to analyze the time series data in order to prepare it for further forecasting using python libraries. The goal is to prepare the time series data which is useful for time series forecasting algorithms to forecast wind power generation. The forecasting algorithms are not part of this write up, I will write a separate blog for the same.

What is Time Series Data Analysis?

Time series data analysis is a statistical method used to analyze data points collected over time. It involves studying the patterns and trends of the data to identify any underlying patterns or relationships between the data points.

In time series analysis, the data is usually collected at regular intervals, such as daily, weekly, or monthly, and the data points are arranged in chronological order. The analysis typically involves looking at the statistical properties of the data, such as mean, variance, and correlation, to identify trends, patterns, and relationships over time.

Time series analysis is commonly used in various fields, including economics, finance, weather forecasting, signal processing, and engineering, to name a few. It helps in predicting future trends and patterns in the data and helps in making informed decisions based on the analysis.

What is Exploratory Data Analysis (EDA)?

Exploratory data analysis (EDA) is the process of analyzing and summarizing data sets to gain insights and identify patterns, relationships, anomalies, and potential issues or outliers. It is an essential step in data science, machine learning, and statistical analysis that helps to understand the data better and formulate hypotheses for further investigation.

The goal of EDA is to explore the data, identify important features and characteristics, and generate visualizations and statistical summaries that reveal meaningful information. The process involves various techniques such as histograms, scatter plots, box plots, heat maps, correlation matrices, and other exploratory plots to investigate the data.

EDA is also used to identify data quality issues, such as missing or inconsistent data, outliers, and anomalies, and to develop strategies to address them. Overall, EDA helps to ensure that the data is suitable for the intended analysis and that the results are accurate and reliable.

Once you have completed EDA you need to prepare the data for further analysis and forecasting. Data preparation is a crucial step in time series data analysis because time series data is often noisy and inconsistent, and can contain missing or erroneous values. Data preparation involves cleaning, transforming, and normalizing the data to make it suitable for analysis.

Quality of data: Time series data can contain errors and missing values, which can affect the accuracy of analysis. Data preparation helps identify and remove such errors and missing values from the data.
Consistency of data: Time series data is collected at different intervals and can have inconsistencies due to changes in measurement instruments or methods. Data preparation ensures that the data is consistent and can be compared across time.
Feature selection: Time series data can have many variables or features that are not relevant to the analysis. Data preparation helps in selecting the relevant features for analysis.
Normalization: Time series data can have different scales and ranges for different variables, making it difficult to compare them. Data preparation normalizes the data, making it comparable.

Overall, data preparation is necessary for time series data analysis to ensure that the data is accurate, consistent, and relevant for analysis and forecasting.

Data Set Used for EDA

I have used wind energy data set from Kaggle for the analysis. This data set contains various weather, turbine and rotor features. Data has been recorded from January 2018 till March 2020. Readings have been recorded at a 10-minute interval.

Let’s define the goal of this analysis. The goal of this analysis is to predict the wind power generation on weekly basis.

#Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#Load the dataset
data = pd.read_csv("Turbine_Data.csv")

As a first step, run below commands on the data loaded. The objective of this step is to familiarize with the data set. Have a good look of the data, check the columns and rows. Pay attention to missing values, data types etc.

# show the first few rows of the dataset
data.head() 
# show the last few rows of the dataset
data.tail()
# show information about the dataset such as data types and missing values
data.info() 
# show statistical summaries of the dataset
data.describe()

The first column name containing timestamp is empty. Let’s rename it to “Date”

# rename first column to "Date"
data.columns.values[0]='Date'

Check Outliers

Let’s plot a boxplot for the power generation values for all the months in the dataset

sns.set(rc={'figure.figsize':(14,10)})
sns.boxplot(x=data.index.month_name(), y='ActivePower', data=data, palette='muted')
plt.ylabel('ActivePower')
plt.xlabel('Month')
plt.title("Power Generated")

Below is the boxplot generated.

You can notice that there are negative values for active power in the above boxplot which should not be possible. Let’s check it further

#Print ActivePower negative values
data['ActivePower'][data['ActivePower']<0]

Below is the output

Date
2018-01-01 00:00:00+00:00   -5.357727
2018-01-01 00:10:00+00:00   -5.822360
2018-01-01 00:20:00+00:00   -5.279409
2018-01-01 00:30:00+00:00   -4.648054
2018-01-01 00:40:00+00:00   -4.684632
                               ...   
2020-03-30 03:50:00+00:00   -7.005695
2020-03-30 04:00:00+00:00   -5.576951
2020-03-30 04:10:00+00:00   -4.945515
2020-03-30 04:20:00+00:00   -6.565684
2020-03-30 04:40:00+00:00   -9.814717
Name: ActivePower, Length: 15644, dtype: float64

Negative power generation doesn’t makes sense. Let’s drop the rows with negative power generation values

#Filtering rows with postive power generation values
data =  data[data['ActivePower']>=0]

Check unique values

Let’s start with checking the unique values in the data set.

#Check unique values
data.nunique()

Below is the output. Notice that there are two columns that have value as 1 which means they have constant value in all the rows.

ActivePower                     78449
AmbientTemperatue               77943
BearingShaftTemperature         52091
Blade1PitchAngle                33383
Blade2PitchAngle                33431
Blade3PitchAngle                33431
ControlBoxTemperature               1
GearboxBearingTemperature       52099
GearboxOilTemperature           52171
GeneratorRPM                    51730
GeneratorWinding1Temperature    52188
GeneratorWinding2Temperature    52191
HubTemperature                  31207
MainBoxTemperature              41376
NacellePosition                  4990
ReactivePower                   78413
RotorRPM                        51641
TurbineStatus                     121
WTG                                 1
WindDirection                    4990
WindSpeed                       78497

Usually constant value do not have any impliction in our analysis. Therefore let’s drop the columns with constant value, namely — ‘WTG’ and ‘ControlBoxTemperature’

# drop columns with constant values
data = data.drop(['WTG', 'ControlBoxTemperature'], axis=1)
data.nunique()

Let’s quickly verify the output below after removing the two columns

ActivePower                     78449
AmbientTemperatue               77943
BearingShaftTemperature         52091
Blade1PitchAngle                33383
Blade2PitchAngle                33431
Blade3PitchAngle                33431
GearboxBearingTemperature       52099
GearboxOilTemperature           52171
GeneratorRPM                    51730
GeneratorWinding1Temperature    52188
GeneratorWinding2Temperature    52191
HubTemperature                  31207
MainBoxTemperature              41376
NacellePosition                  4990
ReactivePower                   78413
RotorRPM                        51641
TurbineStatus                     121
WindDirection                    4990
WindSpeed                       78497

Check Missing Values

The columns may contain null values which will need to be handled. Let’s check the null values in the data set.

# count the number of missing values in each column
data.isnull().sum()

Below is the output. Notice that there are so many null or missing values in each column.

ActivePower                         0
AmbientTemperatue                1032
BearingShaftTemperature         26932
Blade1PitchAngle                43397
Blade2PitchAngle                43480
Blade3PitchAngle                43480
GearboxBearingTemperature       26930
GearboxOilTemperature           26915
GeneratorRPM                    26919
GeneratorWinding1Temperature    26901
GeneratorWinding2Temperature    26894
HubTemperature                  27041
MainBoxTemperature              26952
NacellePosition                 20429
ReactivePower                      42
RotorRPM                        26925
TurbineStatus                   26577
WindDirection                   20429
WindSpeed                         308

Missing value treatment using linear interpolation

Linear interpolation is a technique used to estimate the value of a variable at a given point based on the values of the variable at surrounding points. It assumes that the variable changes at a constant rate between adjacent data points.

Let’s apply linear interpolation on the missing data points.

## Linear Interpolation Method
data=data.interpolate(method ='linear', limit_direction ='backward')
data.isnull().sum()

Below is the output after linear interpolation of missing values is done successfully.

Date                            0
ActivePower                     0
AmbientTemperatue               0
BearingShaftTemperature         0
Blade1PitchAngle                0
Blade2PitchAngle                0
Blade3PitchAngle                0
GearboxBearingTemperature       0
GearboxOilTemperature           0
GeneratorRPM                    0
GeneratorWinding1Temperature    0
GeneratorWinding2Temperature    0
HubTemperature                  0
MainBoxTemperature              0
NacellePosition                 0
ReactivePower                   0
RotorRPM                        0
TurbineStatus                   0
WindDirection                   0
WindSpeed                       0

Now that missing values are treated, we can start plotting the data. Our variable of interest is “Active Power” as it contains the power generated on a given timestamp. Let’s quickly plot the the power generate against time.

Let’s plot the active power

# plot 'ActivePower' for the dates
data.reset_index().plot(x = 'Date', y = 'ActivePower')

The above plot is not clear as too many data points are involved. Since the data is sampled every 10 min there are too many data points. We are interested in weekly data point. There we will resample the data on daily basis, rolling up by “mean” of data points.

# resample the data on daily frequency
data = data.resample("D").mean()
data.reset_index().plot(x = "Date", y = 'ActivePower')
plt.legend(loc='best')
plt.title('Turbine Data')
plt.show(block=False)

Below is the output of resampling. Pretty clean, right?

Correlation of variables

Correlation refers to the degree to which two or more variables are related or associated with each other. Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables.

# visualize the degree of correlation between variables
sns.heatmap(data.corr(), annot=True)

Below is the output displaying the heat map with Pearson correlation coefficient.

In the above graph the correlation coefficient is the number in each box. Let’s look at our variable of interest “ActivePower”. Consider the magnitude of the correlation coefficient.

Generally, correlation coefficients above 0.5 or below -0.5 are considered strong, while those between 0.3 and 0.5 or -0.3 and -0.5 are considered moderate.

Let’s remove variables with low coefficient i.e. the between 0.5 and -0.5

# dropping columns with low coeffecient 
data = data.drop(['AmbientTemperatue', 'TurbineStatus', 'WindDirection', 'NacellePosition', 'MainBoxTemperature','HubTemperature','Blade3PitchAngle','Blade2PitchAngle', 'Blade1PitchAngle'], axis=1)

Now we have the data prepared that we can use for forecasting. We can draw a pair plot as the last step to understand the pairwise relationships between variables in a dataset.

Multivariate Analysis

Let’s create a pair plot that shows scatter plots of all possible pairs of variables in the DataFrame, with the points colored based on the value of the “ActivePower” column. Additionally, the diagonal subplots will show histograms of each variable in the DataFrame.

# creating a pairplot for 'ActivePower' values with remaining columns
sns.pairplot(data, hue='ActivePower', diag_kind='hist')
plt.show()

In the above analysis it can be concluded that the power generation is in direct correlation with wind speed and generator RPM.

This completes the basic EDA on wind energy data set. There are many libraries available that can be used to do further analysis if required.

I will use the data prepared in this write up to do the time series forecasting in the upcoming blog very soon.

Stay tuned!