This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.
Author: Eric Marsden eric.marsden@risk-engineering.org
In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the accompanying course material for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook.
Linear regression analysis of smoking data¶
import numpy
import pandas
import scipy.stats
import matplotlib.pyplot as plt
plt.style.use("bmh")
%config InlineBackend.figure_formats=["svg"]
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
The British Doctors’ Study followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses.
The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking:
- never smoked
- 1-14 per day
- 15-24 per day
- more than 25 per day
The data is as follows:
df = pandas.DataFrame({"cigarettes" : [0,10,20,30], "CVD" : [572,802,892,1025], "lung" : [14, 105, 208, 355]})
df.head()
Note: CVD is an acronym for cardio-vascular disease.
Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking.
plt.plot(df.cigarettes, df.CVD, "o")
plt.margins(0.1)
plt.title("Deaths for different smoking intensities", weight="bold")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("CVD deaths");
It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the statsmodels
library (an alternative possibility is to use the scikit-learn
library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in statsmodels.formula.api
.
For simple linear regression (with a single predictor variable), formulas are written outcome ~ observation
.
import statsmodels.formula.api as smf
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.CVD, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Deaths from cardiovascular disease")
plt.margins(0.1)
In the regression model, $\beta_0$ is the intercept of the regression line and $\beta_1$ is its slope.
We can make a similar plot for lung cancer deaths.
lm = smf.ols("lung ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.lung, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Lung cancer deaths")
plt.margins(0.1)
We can use the linear model for prediction, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smoke 15 cigarettes per day is
# use the fitted model for prediction
lm.predict({"cigarettes": [15]}) / 100000.0
Note that we have divided by 100000, because the data given is the number of deaths per 100000 people.
We can examine some information on the “goodness of fit”, or the quality of the linear model:
lm.summary()
In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations), the 95% confidence interval for our model parameters $\beta_0$ and $\beta_1$ is quite large. In fact, our 4 data points are based on a large number of observations, so our confidence in this data is high, but unfortunately we don't have access to the original data (where each observation corresponds to one person).
The Seaborn package provides convenient functions for making plots of linear regression models. In particular, the regplot
function generates a regression plot that includes 95% confidence intervals for the model parameters.
import seaborn as sns
sns.regplot(df, x="cigarettes", y="CVD");