This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.
Author: Eric Marsden eric.marsden@risk-engineering.org
In this notebook, we illustrate features of the NumPy, Pandas and statsmodels libraries features for building linear regression models in Python. Consult the accompanying course materials for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook.
Linear regression analysis of combined cycle power plant data¶
import numpy
import pandas
import matplotlib.pyplot as plt
import scipy.stats
plt.style.use("bmh")
%config InlineBackend.figure_formats=["png"]
We will analyze data from a combined cycle power plant to attempt to build a predictive model for output power. The data comes from the UCI machine learning repository.
http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
The dataset contains 9568 data points collected from a combined cycle power plant over 6 years, when power plant was under full load. A combined cycle power plant is composed of gas turbines, steam turbines and heat recovery steam generators. Electricity is generated by gas & steam turbines, which are combined in one cycle. Three ambient variables affect the performance of the gas turbine, and exhaust vacuum affects the performance of the steam turbine.
Data consists of hourly averages taken from various sensors located around the plant that record the ambient variables every second.
Let’s load it into Python and examine it using the pandas
library. For convenience, we have unzipped the dataset and made it web accessible.
data = pandas.read_csv("https://risk-engineering.org/static/data/CCPP.csv")
data.head()
The table below provides the meaning of the various columns.
Meaning | Name | Range |
---|
| Ambient Temperature | AT | 1.81 – 37.11°C | Ambient Pressure | AP | 992.89 – 1033.30 millibar | Relative Humidity | RH | 25.56% – 100.16% | Exhaust Vacuum | V | 25.36 – 81.56 cm Hg | Net hourly electrical energy output | PE | 420.26 – 495.76 MW |
data.describe()
Visualization¶
We can obtain a first impression of the dependency between variables by examining a multidimensional scatterplot.
from pandas.plotting import scatter_matrix
scatter_matrix(data, diagonal="kde");
In this matrix, the diagonal contains a plot of the distribution of each variable. We observe that:
there is an approximately linear relationship between PE and the negative of AT
there is an approximately linear relationship between PE and negative of V
We can also generate a 3D plot of the observations, which can sometimes help to interpret the data more easily. Here we plot PE as a function of AT and V.
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(projection="3d")
fig.add_axes(ax)
ax.scatter(data["AT"], data["V"], data["PE"])
ax.set_xlabel("AT")
ax.set_ylabel("V")
ax.set_zlabel("PE")
ax.set_facecolor("white")
Prediction¶
We will created a fitted linear model using the formula API of the statsmodels
library. We use all the observations except for PE as predictor variables in the multiple linear regression.
import statsmodels.formula.api as smf
lm = smf.ols(formula="PE ~ AT + V + AP + RH", data=data).fit()
lm.params
This means that the best formula to estimate output power as a function of AT, V, AP and RH is
$$PE = 451 . 067793 − 1 . 974731 AT − 0 . 234992 V + 0 . 065540 AP − 0 . 157598 RH$$
For any particular observation (values for the predictor variables), we can use the linear model to estimate the output variable PE.
lm.predict(pandas.DataFrame({"AT": [9.48], "V": [44.71], "AP": [1019.12], "RH": [66.43]}))
The predicted output power for this combination of inputs is 478 MW (note that we report the result with the same number of significant figures as that in our inputs).
Residuals plots¶
We check the residuals of each predictor variable for any pattern that might indicate that a linear model is not appropriate for this dataset.
residuals = lm.predict(data) - data.PE
plt.plot(data.AT, residuals, ".", alpha=0.5)
plt.xlabel("AT")
plt.ylabel("Residual");
plt.plot(data.V, residuals, ".", alpha=0.5)
plt.xlabel("V")
plt.ylabel("Residual");
plt.plot(data.AP, residuals, ".", alpha=0.5)
plt.xlabel("AP")
plt.ylabel("Residual");
plt.plot(data.RH, residuals, ".", alpha=0.5)
plt.xlabel("RH")
plt.ylabel("Residual");
Indeed, except for a minor quadratic shape to the residuals of variable AT (which we will ignore here), the residuals look random, without any systematic feature apparent that might indicate that our linear model is not appropriate for this data.
We also check that the variance of the residuals is normally distributed by plotting a histogram or a QQ plot of the residuals, as shown below.
plt.hist(residuals, bins=40, alpha=0.5)
plt.title("Histogram of the residuals");
scipy.stats.probplot(residuals, dist=scipy.stats.norm, plot=plt.figure().add_subplot(111));
Goodness of fit¶
lm.summary()
The $R^2$ for this model is 0.929, which means that it explains roughly 93% of the variance of the output power. Interpretation of this number is dependent on the application (for instance, models used to understand health effects tend to have lower $R^2$ values than those used for physical models), but is quite satisfactory for this application.