I completed this project as part of a Coursera course that focused on data science and social good. I’m always keen to learn more about predictive analytics, given it has multiple benefits in the energy sector and learning more about the energy / sustainability sector.
Wind power forecasting is crucial for optimising grid stability and enhancing the efficiency of renewable energy sources. Data science and predictive analytics is a critical tool for estimating future output from wind farms that can leverage large amounts of data to improve predictions.
Data Source & Methodology
This project investigates the Spatial Dynamic Wind Power Forecasting (SDWPF) dataset, which contains data from 134 wind turbines from a wind farm in China. The SDWPF data was provided by the Longyuan Power Group, which is the largest wind power producer in China and Asia. We will design a solution to forecast wind power.
Data Preparation
Show the code
import numpy as np # package for numerical calculationsimport pandas as pd # package for reading in and manipulating dataimport utils_1 as utils # utility functions for this labimport warningswarnings.filterwarnings('ignore')print('All packages imported successfully!')## All packages imported successfully!
Load the dataset
The original dataset contains information of 134 turbines. We will select the top 10 turbines that produced the most power on average, and convert the day and timestamp columns into a single datetime column.
Show the code
# Load the data from the csv fileraw_data = pd.read_csv("./data/wtbdata_245days.csv")# Select only the top 10 turbinestop_turbines = utils.top_n_turbines(raw_data, 10)## Original data has 4727520 rows from 134 turbines.## ## Sliced data has 352800 rows from 10 turbines.# # # Format datetime (this takes around 15 secs)top_turbines = utils.format_datetime(top_turbines, initial_date_str="01 05 2020")# # # Print out the first few lines of datatop_turbines.head()## Datetime TurbID Wspd Wdir ... Pab2 Pab3 Prtv Patv## 0 2020-05-01 00:00:00 1 NaN NaN ... NaN NaN NaN NaN## 1 2020-05-01 00:10:00 1 6.17 -3.99 ... 1.0 1.0 -0.25 494.66## 2 2020-05-01 00:20:00 1 6.27 -2.18 ... 1.0 1.0 -0.24 509.76## 3 2020-05-01 00:30:00 1 6.42 -0.73 ... 1.0 1.0 -0.26 542.53## 4 2020-05-01 00:40:00 1 6.25 0.89 ... 1.0 1.0 -0.23 509.36## ## [5 rows x 12 columns]
Catalog abnormal values
In the paper associated with this dataset, the authors explain that some values should be excluded from the analysis because they are either missing, unknown or abnormal.
missing values are self explanatory but here are the definitions for the other two types:
unknown: - if Patv ≤ 0 and Wspd > 2.5 - if Pab1 > 89° or Pab2 > 89° or Pab3 > 89°
abnormal: - if Ndir < -720 or Ndir > 720 - if Wdir < -180 or Wdir > 180
We create a new column called Include in the dataframe and set the value to False for every missing / unknown / abnormal value:
Next we create a baseline for wind power estimation using a linear regression model to fit the relationship between wind speed and power output. Plots of predicted vs actual power output values and mean absolute error for Turbine 1 are generated below.
Show the code
utils.fit_and_plot_linear_model( data_og=clean_data, turbine =1, features = ["Wspd"])## Turbine 1, Mean Absolute Error (kW): 106.27
Feature engineering
During the Feature Engineering process we will transform existing features into better representations, combine features, fix issues with them and create new features.
Delete redundant features - Pab
All of the Pab# features (which stands for pitch angle blade #) are perfectly correlated, which means that they are redundant. Instead we keep only one of these features and rename it as Pab.
There are 3 features (Wdir, Ndir, Pab) which are encoded in degrees. This is problematic because the model has no way of knowing that angles with very different values (such as 0° and 360°) are actually very similar (the same in this case) to each other. To address this you can transform these features into their sine/cosine representations.
The variables Etmp and Itmp both contain large negative values. These minimum values are very close to the absolute zero (-273.15 °C) which is most certainly an error (linear interpolation is used to fix these values). The paper indicates that negative values indicate active power, which should be treated as zero.
These are the final steps to prepare data for modeling.
Show the code
# Define predictor features predictors = [f for f in clean_data.columns if f notin ["Datetime", "TurbID", "Patv"]]# Define target featuretarget = ["Patv"]# Re-arrange features before feeding into modelsmodel_data = clean_data[["TurbID"]+predictors+target]model_data.head(5)## TurbID Wspd Etmp ... Time-of-day sin Time-of-day cos Patv## 1 1 6.17 30.73 ... 0.043619 0.999048 494.66## 2 1 6.27 30.60 ... 0.087156 0.996195 509.76## 3 1 6.42 30.52 ... 0.130526 0.991445 542.53## 4 1 6.25 30.49 ... 0.173648 0.984808 509.36## 5 1 6.10 30.47 ... 0.216440 0.976296 482.21## ## [5 rows x 14 columns]
Update linear model baseline with more features
Now we’ll model with the new set of features. We have the same plots as before as well as a plot that shows the average feature importance for every feature included:
Show the code
# Create a linear model with more featuresutils.fit_and_plot_linear_model( data_og=model_data, turbine =1, features = predictors)## Turbine 1, Mean Absolute Error (kW): 101.49
Use a neural network to improve wind power estimation
A neural network will be run for comparison to the linear model. This model will initially contain all predictors:
As expected, the relationship of wind speed and power generation was strong. To a lesser extent, temperature was found to predict power generation. Other predictors did contribute to the predictive ability of the model with an increased MAE (Mean Absolute Error) when other predictors were removed from the model.
The Mean Absolute Error reduced to around 30 for the neural network compared to around 100 for the linear regression model, which indicates our predictive ability has vastly improved with the neural network.