Time Series : Exploratory Data Analysis

Javed Afroz
8 min readFeb 27, 2023

In this section I will discuss various techniques used to analyze the time series data in order to prepare it for further forecasting using python libraries. The goal is to prepare the time series data which is useful for time series forecasting algorithms to forecast wind power generation. The forecasting algorithms are not part of this write up, I will write a separate blog for the same.

What is Time Series Data Analysis?

Time series data analysis is a statistical method used to analyze data points collected over time. It involves studying the patterns and trends of the data to identify any underlying patterns or relationships between the data points.

In time series analysis, the data is usually collected at regular intervals, such as daily, weekly, or monthly, and the data points are arranged in chronological order. The analysis typically involves looking at the statistical properties of the data, such as mean, variance, and correlation, to identify trends, patterns, and relationships over time.

Time series analysis is commonly used in various fields, including economics, finance, weather forecasting, signal processing, and engineering, to name a few. It helps in predicting future trends and patterns in the data and helps in making informed decisions based on the analysis.

What is Exploratory Data Analysis (EDA)?

Exploratory data analysis (EDA) is the process of analyzing and summarizing data sets to gain insights and identify patterns, relationships, anomalies, and potential issues or outliers. It is an essential step in data science, machine learning, and statistical analysis that helps to understand the data better and formulate hypotheses for further investigation.

The goal of EDA is to explore the data, identify important features and characteristics, and generate visualizations and statistical summaries that reveal meaningful information. The process involves various techniques such as histograms, scatter plots, box plots, heat maps, correlation matrices, and other exploratory plots to investigate the data.

EDA is also used to identify data quality issues, such as missing or inconsistent data, outliers, and anomalies, and to develop strategies to address them. Overall, EDA helps to ensure that the data is suitable for the intended analysis and that the results are accurate and reliable.

Once you have completed EDA you need to prepare the data for further analysis and forecasting. Data preparation is a crucial step in time series data analysis because time series data is often noisy and inconsistent, and can contain missing or erroneous values. Data preparation involves cleaning, transforming, and normalizing the data to make it suitable for analysis.

  1. Quality of data: Time series data can contain errors and missing values, which can affect the accuracy of analysis. Data preparation helps identify and remove such errors and missing values from the data.
  2. Consistency of data: Time series data is collected at different intervals and can have inconsistencies due to changes in measurement instruments or methods. Data preparation ensures that the data is consistent and can be compared across time.
  3. Feature selection: Time series data can have many variables or features that are not relevant to the analysis. Data preparation helps in selecting the relevant features for analysis.
  4. Normalization: Time series data can have different scales and ranges for different variables, making it difficult to compare them. Data preparation normalizes the data, making it comparable.

Overall, data preparation is necessary for time series data analysis to ensure that the data is accurate, consistent, and relevant for analysis and forecasting.

Data Set Used for EDA

I have used wind energy data set from Kaggle for the analysis. This data set contains various weather, turbine and rotor features. Data has been recorded from January 2018 till March 2020. Readings have been recorded at a 10-minute interval.

Let’s define the goal of this analysis. The goal of this analysis is to predict the wind power generation on weekly basis.

#Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#Load the dataset
data = pd.read_csv("Turbine_Data.csv")

As a first step, run below commands on the data loaded. The objective of this step is to familiarize with the data set. Have a good look of the data, check the columns and rows. Pay attention to missing values, data types etc.

# show the first few rows of the dataset
data.head()
# show the last few rows of the dataset
data.tail()
# show information about the dataset such as data types and missing values
data.info()
# show statistical summaries of the dataset
data.describe()

The first column name containing timestamp is empty. Let’s rename it to “Date”

# rename first column to "Date"
data.columns.values[0]='Date'

Check Outliers

Let’s plot a boxplot for the power generation values for all the months in the dataset

sns.set(rc={'figure.figsize':(14,10)})
sns.boxplot(x=data.index.month_name(), y='ActivePower', data=data, palette='muted')
plt.ylabel('ActivePower')
plt.xlabel('Month')
plt.title("Power Generated")

Below is the boxplot generated.

You can notice that there are negative values for active power in the above boxplot which should not be possible. Let’s check it further

#Print ActivePower negative values
data['ActivePower'][data['ActivePower']<0]

Below is the output

Date
2018-01-01 00:00:00+00:00 -5.357727
2018-01-01 00:10:00+00:00 -5.822360
2018-01-01 00:20:00+00:00 -5.279409
2018-01-01 00:30:00+00:00 -4.648054
2018-01-01 00:40:00+00:00 -4.684632
...
2020-03-30 03:50:00+00:00 -7.005695
2020-03-30 04:00:00+00:00 -5.576951
2020-03-30 04:10:00+00:00 -4.945515
2020-03-30 04:20:00+00:00 -6.565684
2020-03-30 04:40:00+00:00 -9.814717
Name: ActivePower, Length: 15644, dtype: float64

Negative power generation doesn’t makes sense. Let’s drop the rows with negative power generation values

#Filtering rows with postive power generation values
data = data[data['ActivePower']>=0]

Check unique values

Let’s start with checking the unique values in the data set.

#Check unique values
data.nunique()

Below is the output. Notice that there are two columns that have value as 1 which means they have constant value in all the rows.

ActivePower                     78449
AmbientTemperatue 77943
BearingShaftTemperature 52091
Blade1PitchAngle 33383
Blade2PitchAngle 33431
Blade3PitchAngle 33431
ControlBoxTemperature 1
GearboxBearingTemperature 52099
GearboxOilTemperature 52171
GeneratorRPM 51730
GeneratorWinding1Temperature 52188
GeneratorWinding2Temperature 52191
HubTemperature 31207
MainBoxTemperature 41376
NacellePosition 4990
ReactivePower 78413
RotorRPM 51641
TurbineStatus 121
WTG 1
WindDirection 4990
WindSpeed 78497

Usually constant value do not have any impliction in our analysis. Therefore let’s drop the columns with constant value, namely — ‘WTG’ and ‘ControlBoxTemperature’

# drop columns with constant values
data = data.drop(['WTG', 'ControlBoxTemperature'], axis=1)
data.nunique()

Let’s quickly verify the output below after removing the two columns

ActivePower                     78449
AmbientTemperatue 77943
BearingShaftTemperature 52091
Blade1PitchAngle 33383
Blade2PitchAngle 33431
Blade3PitchAngle 33431
GearboxBearingTemperature 52099
GearboxOilTemperature 52171
GeneratorRPM 51730
GeneratorWinding1Temperature 52188
GeneratorWinding2Temperature 52191
HubTemperature 31207
MainBoxTemperature 41376
NacellePosition 4990
ReactivePower 78413
RotorRPM 51641
TurbineStatus 121
WindDirection 4990
WindSpeed 78497

Check Missing Values

The columns may contain null values which will need to be handled. Let’s check the null values in the data set.

# count the number of missing values in each column
data.isnull().sum()

Below is the output. Notice that there are so many null or missing values in each column.

ActivePower                         0
AmbientTemperatue 1032
BearingShaftTemperature 26932
Blade1PitchAngle 43397
Blade2PitchAngle 43480
Blade3PitchAngle 43480
GearboxBearingTemperature 26930
GearboxOilTemperature 26915
GeneratorRPM 26919
GeneratorWinding1Temperature 26901
GeneratorWinding2Temperature 26894
HubTemperature 27041
MainBoxTemperature 26952
NacellePosition 20429
ReactivePower 42
RotorRPM 26925
TurbineStatus 26577
WindDirection 20429
WindSpeed 308

Missing value treatment using linear interpolation

Linear interpolation is a technique used to estimate the value of a variable at a given point based on the values of the variable at surrounding points. It assumes that the variable changes at a constant rate between adjacent data points.

Let’s apply linear interpolation on the missing data points.

## Linear Interpolation Method
data=data.interpolate(method ='linear', limit_direction ='backward')
data.isnull().sum()

Below is the output after linear interpolation of missing values is done successfully.

Date                            0
ActivePower 0
AmbientTemperatue 0
BearingShaftTemperature 0
Blade1PitchAngle 0
Blade2PitchAngle 0
Blade3PitchAngle 0
GearboxBearingTemperature 0
GearboxOilTemperature 0
GeneratorRPM 0
GeneratorWinding1Temperature 0
GeneratorWinding2Temperature 0
HubTemperature 0
MainBoxTemperature 0
NacellePosition 0
ReactivePower 0
RotorRPM 0
TurbineStatus 0
WindDirection 0
WindSpeed 0

Now that missing values are treated, we can start plotting the data. Our variable of interest is “Active Power” as it contains the power generated on a given timestamp. Let’s quickly plot the the power generate against time.

Let’s plot the active power

# plot 'ActivePower' for the dates
data.reset_index().plot(x = 'Date', y = 'ActivePower')

The above plot is not clear as too many data points are involved. Since the data is sampled every 10 min there are too many data points. We are interested in weekly data point. There we will resample the data on daily basis, rolling up by “mean” of data points.

# resample the data on daily frequency
data = data.resample("D").mean()
data.reset_index().plot(x = "Date", y = 'ActivePower')
plt.legend(loc='best')
plt.title('Turbine Data')
plt.show(block=False)

Below is the output of resampling. Pretty clean, right?

Correlation of variables

Correlation refers to the degree to which two or more variables are related or associated with each other. Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables.

# visualize the degree of correlation between variables
sns.heatmap(data.corr(), annot=True)

Below is the output displaying the heat map with Pearson correlation coefficient.

In the above graph the correlation coefficient is the number in each box. Let’s look at our variable of interest “ActivePower”. Consider the magnitude of the correlation coefficient.

Generally, correlation coefficients above 0.5 or below -0.5 are considered strong, while those between 0.3 and 0.5 or -0.3 and -0.5 are considered moderate.

Let’s remove variables with low coefficient i.e. the between 0.5 and -0.5

# dropping columns with low coeffecient 
data = data.drop(['AmbientTemperatue', 'TurbineStatus', 'WindDirection', 'NacellePosition', 'MainBoxTemperature','HubTemperature','Blade3PitchAngle','Blade2PitchAngle', 'Blade1PitchAngle'], axis=1)

Now we have the data prepared that we can use for forecasting. We can draw a pair plot as the last step to understand the pairwise relationships between variables in a dataset.

Multivariate Analysis

Let’s create a pair plot that shows scatter plots of all possible pairs of variables in the DataFrame, with the points colored based on the value of the “ActivePower” column. Additionally, the diagonal subplots will show histograms of each variable in the DataFrame.

# creating a pairplot for 'ActivePower' values with remaining columns
sns.pairplot(data, hue='ActivePower', diag_kind='hist')
plt.show()

In the above analysis it can be concluded that the power generation is in direct correlation with wind speed and generator RPM.

This completes the basic EDA on wind energy data set. There are many libraries available that can be used to do further analysis if required.

I will use the data prepared in this write up to do the time series forecasting in the upcoming blog very soon.

Stay tuned!

--

--

Javed Afroz

Javed is a solution architect with 15 years of experience in diverse technology domains.