Predicting wind energy

Introduction

I completed this project as part of a Coursera course that focused on data science and social good. I’m always keen to learn more about predictive analytics, given it has multiple benefits in the energy sector and learning more about the energy / sustainability sector.

Wind power forecasting is crucial for optimising grid stability and enhancing the efficiency of renewable energy sources. Data science and predictive analytics is a critical tool for estimating future output from wind farms that can leverage large amounts of data to improve predictions.

Data Source & Methodology

This project investigates the Spatial Dynamic Wind Power Forecasting (SDWPF) dataset, which contains data from 134 wind turbines from a wind farm in China. The SDWPF data was provided by the Longyuan Power Group, which is the largest wind power producer in China and Asia. We will design a solution to forecast wind power.

Data Preparation

Show the code

import numpy as np # package for numerical calculations
import pandas as pd # package for reading in and manipulating data
import utils_1 as utils # utility functions for this lab

import warnings
warnings.filterwarnings('ignore')

print('All packages imported successfully!')
## All packages imported successfully!

Load the dataset

The original dataset contains information of 134 turbines. We will select the top 10 turbines that produced the most power on average, and convert the day and timestamp columns into a single datetime column.

Show the code
# Load the data from the csv file
raw_data = pd.read_csv("./data/wtbdata_245days.csv")

# Select only the top 10 turbines
top_turbines = utils.top_n_turbines(raw_data, 10)
## Original data has 4727520 rows from 134 turbines.
## 
## Sliced data has 352800 rows from 10 turbines.
# 
# # Format datetime (this takes around 15 secs)
top_turbines = utils.format_datetime(top_turbines, initial_date_str="01 05 2020")
# 
# # Print out the first few lines of data
top_turbines.head()
##              Datetime  TurbID  Wspd  Wdir  ...  Pab2  Pab3  Prtv    Patv
## 0 2020-05-01 00:00:00       1   NaN   NaN  ...   NaN   NaN   NaN     NaN
## 1 2020-05-01 00:10:00       1  6.17 -3.99  ...   1.0   1.0 -0.25  494.66
## 2 2020-05-01 00:20:00       1  6.27 -2.18  ...   1.0   1.0 -0.24  509.76
## 3 2020-05-01 00:30:00       1  6.42 -0.73  ...   1.0   1.0 -0.26  542.53
## 4 2020-05-01 00:40:00       1  6.25  0.89  ...   1.0   1.0 -0.23  509.36
## 
## [5 rows x 12 columns]

Catalog abnormal values

In the paper associated with this dataset, the authors explain that some values should be excluded from the analysis because they are either missing, unknown or abnormal.

missing values are self explanatory but here are the definitions for the other two types:

unknown: - if Patv ≤ 0 and Wspd > 2.5 - if Pab1 > 89° or Pab2 > 89° or Pab3 > 89°

abnormal: - if Ndir < -720 or Ndir > 720 - if Wdir < -180 or Wdir > 180

We create a new column called Include in the dataframe and set the value to False for every missing / unknown / abnormal value:

Show the code
# Initially include all rows
top_turbines["Include"] = True

# Define conditions for abnormality
conditions = [
    np.isnan(top_turbines.Patv),
    (top_turbines.Pab1 > 89) | (top_turbines.Pab2 > 89) | (top_turbines.Pab3 > 89),
    (top_turbines.Ndir < -720) | (top_turbines.Ndir > 720),
    (top_turbines.Wdir < -180) | (top_turbines.Wdir > 180),
    (top_turbines.Patv <= 0) & (top_turbines.Wspd > 2.5)
]

# Exclude abnormal features
for condition in conditions:
    top_turbines = utils.tag_abnormal_values(top_turbines, condition)
    
clean_data = top_turbines[top_turbines.Include].drop(["Include"], axis=1)

clean_data.head()
##              Datetime  TurbID  Wspd  Wdir  ...  Pab2  Pab3  Prtv    Patv
## 1 2020-05-01 00:10:00       1  6.17 -3.99  ...   1.0   1.0 -0.25  494.66
## 2 2020-05-01 00:20:00       1  6.27 -2.18  ...   1.0   1.0 -0.24  509.76
## 3 2020-05-01 00:30:00       1  6.42 -0.73  ...   1.0   1.0 -0.26  542.53
## 4 2020-05-01 00:40:00       1  6.25  0.89  ...   1.0   1.0 -0.23  509.36
## 5 2020-05-01 00:50:00       1  6.10 -1.03  ...   1.0   1.0 -0.27  482.21
## 
## [5 rows x 12 columns]

Establish a baseline for wind power estimation

Next we create a baseline for wind power estimation using a linear regression model to fit the relationship between wind speed and power output. Plots of predicted vs actual power output values and mean absolute error for Turbine 1 are generated below.

Show the code

utils.fit_and_plot_linear_model(
  data_og=clean_data,
  turbine = 1,
  features = ["Wspd"]
)
## Turbine 1, Mean Absolute Error (kW): 106.27

Feature engineering

During the Feature Engineering process we will transform existing features into better representations, combine features, fix issues with them and create new features.

Delete redundant features - Pab

All of the Pab# features (which stands for pitch angle blade #) are perfectly correlated, which means that they are redundant. Instead we keep only one of these features and rename it as Pab.

Show the code
# Aggregate pab features
clean_data = utils.cut_pab_features(clean_data)

clean_data.head(5)
##              Datetime  TurbID  Wspd  Wdir  ...   Ndir  Pab  Prtv    Patv
## 1 2020-05-01 00:10:00       1  6.17 -3.99  ...  25.92  1.0 -0.25  494.66
## 2 2020-05-01 00:20:00       1  6.27 -2.18  ...  20.91  1.0 -0.24  509.76
## 3 2020-05-01 00:30:00       1  6.42 -0.73  ...  20.91  1.0 -0.26  542.53
## 4 2020-05-01 00:40:00       1  6.25  0.89  ...  20.91  1.0 -0.23  509.36
## 5 2020-05-01 00:50:00       1  6.10 -1.03  ...  20.91  1.0 -0.27  482.21
## 
## [5 rows x 10 columns]

Transform angle features

There are 3 features (Wdir, Ndir, Pab) which are encoded in degrees. This is problematic because the model has no way of knowing that angles with very different values (such as 0° and 360°) are actually very similar (the same in this case) to each other. To address this you can transform these features into their sine/cosine representations.

Show the code
# Transform all angle-encoded features
for feature in ["Wdir", "Ndir", "Pab"]:
    utils.transform_angles(clean_data, feature)  
    
clean_data.head(5)
##              Datetime  TurbID  Wspd  ...   NdirSin    PabCos    PabSin
## 1 2020-05-01 00:10:00       1  6.17  ...  0.437116  0.999848  0.017452
## 2 2020-05-01 00:20:00       1  6.27  ...  0.356901  0.999848  0.017452
## 3 2020-05-01 00:30:00       1  6.42  ...  0.356901  0.999848  0.017452
## 4 2020-05-01 00:40:00       1  6.25  ...  0.356901  0.999848  0.017452
## 5 2020-05-01 00:50:00       1  6.10  ...  0.356901  0.999848  0.017452
## 
## [5 rows x 13 columns]

Fix temperatures and active power

The variables Etmp and Itmp both contain large negative values. These minimum values are very close to the absolute zero (-273.15 °C) which is most certainly an error (linear interpolation is used to fix these values). The paper indicates that negative values indicate active power, which should be treated as zero.

Show the code
# Fix temperature values
clean_data = utils.fix_temperatures(clean_data)

# Fix negative active powers
clean_data["Patv"] = clean_data["Patv"].apply(lambda x: max(0, x))

clean_data.head(5)
##              Datetime  TurbID  Wspd  ...   NdirSin    PabCos    PabSin
## 1 2020-05-01 00:10:00       1  6.17  ...  0.437116  0.999848  0.017452
## 2 2020-05-01 00:20:00       1  6.27  ...  0.356901  0.999848  0.017452
## 3 2020-05-01 00:30:00       1  6.42  ...  0.356901  0.999848  0.017452
## 4 2020-05-01 00:40:00       1  6.25  ...  0.356901  0.999848  0.017452
## 5 2020-05-01 00:50:00       1  6.10  ...  0.356901  0.999848  0.017452
## 
## [5 rows x 13 columns]

Create time features

You will create features that encode the time-of-day signals for each data point in the dataset. Further details can be found in this post.

Show the code
# Generate time signals
clean_data = utils.generate_time_signals(clean_data)

clean_data.head(5)
##              Datetime  TurbID  Wspd  ...    PabSin  Time-of-day sin  Time-of-day cos
## 1 2020-05-01 00:10:00       1  6.17  ...  0.017452         0.043619         0.999048
## 2 2020-05-01 00:20:00       1  6.27  ...  0.017452         0.087156         0.996195
## 3 2020-05-01 00:30:00       1  6.42  ...  0.017452         0.130526         0.991445
## 4 2020-05-01 00:40:00       1  6.25  ...  0.017452         0.173648         0.984808
## 5 2020-05-01 00:50:00       1  6.10  ...  0.017452         0.216440         0.976296
## 
## [5 rows x 15 columns]

These are the final steps to prepare data for modeling.

Show the code
# Define predictor features 
predictors = [f for f in clean_data.columns if f not in ["Datetime", "TurbID", "Patv"]]

# Define target feature
target = ["Patv"]

# Re-arrange features before feeding into models
model_data = clean_data[["TurbID"]+predictors+target]

model_data.head(5)
##    TurbID  Wspd   Etmp  ...  Time-of-day sin  Time-of-day cos    Patv
## 1       1  6.17  30.73  ...         0.043619         0.999048  494.66
## 2       1  6.27  30.60  ...         0.087156         0.996195  509.76
## 3       1  6.42  30.52  ...         0.130526         0.991445  542.53
## 4       1  6.25  30.49  ...         0.173648         0.984808  509.36
## 5       1  6.10  30.47  ...         0.216440         0.976296  482.21
## 
## [5 rows x 14 columns]

Update linear model baseline with more features

Now we’ll model with the new set of features. We have the same plots as before as well as a plot that shows the average feature importance for every feature included:

Show the code
# Create a linear model with more features

utils.fit_and_plot_linear_model(
  data_og=model_data,
  turbine = 1,
  features = predictors
)
## Turbine 1, Mean Absolute Error (kW): 101.49

Use a neural network to improve wind power estimation

A neural network will be run for comparison to the linear model. This model will initially contain all predictors:

Show the code

# Train a neural network model

from sklearn import metrics
import matplotlib.pyplot as plt

FONT_SIZE_TICKS = 15
FONT_SIZE_TITLE = 20
FONT_SIZE_AXES = 20

data = model_data[model_data.TurbID == 1]

(X_train, X_test, y_train, y_test), (
    X_train_mean,
    X_train_std,
    y_train_mean,
    y_train_std,
) = utils.split_and_normalize(data = data, features = predictors)

train_loader, test_loader = utils.batch_data(
    X_train, X_test, y_train, y_test, batch_size=32
)

model, loss_fn, optimizer = utils.compile_model(features = predictors)
model = utils.train_model(
    model, loss_fn, optimizer, train_loader, test_loader, epochs=5
)
## Epoch: 0 | Train loss: 0.05706 | Test loss: 0.12550
## Epoch: 1 | Train loss: 0.05730 | Test loss: 0.11854
## Epoch: 2 | Train loss: 0.04337 | Test loss: 0.12140
## Epoch: 3 | Train loss: 0.04194 | Test loss: 0.11696
## Epoch: 4 | Train loss: 0.04325 | Test loss: 0.11612

X_test_2 = X_test.detach().numpy()
X_test_denormalized = (X_test_2[:, 0] * X_train_std[0]) + X_train_mean[0]

y_test_denormalized = (y_test * y_train_std) + y_train_mean
test_preds = model(X_test).detach().numpy()
test_preds_denormalized = (test_preds * y_train_std) + y_train_mean
X_plot = data["Wspd"]
Y_real = data["Patv"]
X_eq_Y = np.linspace(0, max(y_test_denormalized), 100)

print(
    f"Mean Absolute Error: {metrics.mean_absolute_error(y_test_denormalized, test_preds_denormalized):.2f}\n"
)
## Mean Absolute Error: 29.02

mae_nn = round(metrics.mean_absolute_error(y_test_denormalized, test_preds_denormalized),2)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

utils.plot_wind_speed_vs_power(ax1, X_plot, Y_real, X_test_denormalized, test_preds_denormalized)
utils.plot_predicted_vs_real(ax2, y_test_denormalized, test_preds_denormalized, X_eq_Y, X_eq_Y)
plt.tight_layout()
plt.show() 

Show the code
# Plot importance of predictors

import shap
import seaborn as sns

train_loader, test_loader = utils.batch_data(
    X_train, X_test, y_train, y_test, batch_size=128
)

x_train_batch, _ = next(iter(train_loader))
x_test_batch, _ = next(iter(test_loader))

model.eval()
## RegressorNet(
##   (fc_layers): Sequential(
##     (0): Linear(in_features=12, out_features=64, bias=True)
##     (1): ReLU()
##     (2): Linear(in_features=64, out_features=32, bias=True)
##     (3): ReLU()
##     (4): Linear(in_features=32, out_features=1, bias=True)
##   )
## )
e = shap.DeepExplainer(model, x_train_batch)
shap_values = e.shap_values(x_test_batch)

means = np.mean(np.abs(shap_values), axis=0)
results = sorted(zip(predictors, means), key = lambda x: x[1], reverse=True)
results_df = pd.DataFrame.from_dict({k: [v] for (k, v) in results})

# Create a plot for feature importance
fig, ax = plt.subplots(figsize=(7.5, 6))
ax.set_xlabel("Importance Score", fontsize=FONT_SIZE_AXES)
## Text(0.5, 0, 'Importance Score')
ax.set_ylabel("Feature", fontsize=FONT_SIZE_AXES)
## Text(0, 0.5, 'Feature')
ax.set_title("Feature Importance", fontsize=FONT_SIZE_TITLE)
## Text(0.5, 1.0, 'Feature Importance')
ax.tick_params(labelsize=FONT_SIZE_TICKS)

x_vals = [float(x) for x in results_df.iloc[0]]
y_vals = results_df.columns

sns.barplot(x=x_vals, y=y_vals, orient="h", ax=ax, color="deepskyblue", width=0.3)
## <Axes: title={'center': 'Feature Importance'}, xlabel='Importance Score', ylabel='Feature'>
plt.tight_layout()

Conclusions

As expected, the relationship of wind speed and power generation was strong. To a lesser extent, temperature was found to predict power generation. Other predictors did contribute to the predictive ability of the model with an increased MAE (Mean Absolute Error) when other predictors were removed from the model.

The Mean Absolute Error reduced to around 30 for the neural network compared to around 100 for the linear regression model, which indicates our predictive ability has vastly improved with the neural network.