{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/).\n",
"\n",
"Author: Eric Marsden \n",
"\n",
"---\n",
"\n",
"In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the [accompanying course material](https://risk-engineering.org/linear-regression-analysis/) for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear regression analysis of smoking data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy\n",
"import pandas\n",
"import scipy.stats\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use(\"bmh\")\n",
"%config InlineBackend.figure_formats=[\"svg\"]\n",
"\n",
"import warnings\n",
"warnings.simplefilter(action=\"ignore\", category=FutureWarning)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The *British Doctors’ Study* followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses. \n",
"\n",
"The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking:\n",
"\n",
"- never smoked\n",
"- 1-14 per day\n",
"- 15-24 per day\n",
"- more than 25 per day\n",
"\n",
"The data is as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
cigarettes
\n",
"
CVD
\n",
"
lung
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
572
\n",
"
14
\n",
"
\n",
"
\n",
"
1
\n",
"
10
\n",
"
802
\n",
"
105
\n",
"
\n",
"
\n",
"
2
\n",
"
20
\n",
"
892
\n",
"
208
\n",
"
\n",
"
\n",
"
3
\n",
"
30
\n",
"
1025
\n",
"
355
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cigarettes CVD lung\n",
"0 0 572 14\n",
"1 10 802 105\n",
"2 20 892 208\n",
"3 30 1025 355"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pandas.DataFrame({\"cigarettes\" : [0,10,20,30], \"CVD\" : [572,802,892,1025], \"lung\" : [14, 105, 208, 355]})\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: CVD is an acronym for cardio-vascular disease. \n",
"\n",
"Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(df.cigarettes, df.CVD, \"o\")\n",
"plt.margins(0.1)\n",
"plt.title(\"Deaths for different smoking intensities\", weight=\"bold\")\n",
"plt.xlabel(\"Cigarettes smoked per day\")\n",
"plt.ylabel(\"CVD deaths\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the `statsmodels` library (an alternative possibility is to use the `scikit-learn` library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in `statsmodels.formula.api`. \n",
"\n",
"For simple linear regression (with a single predictor variable), formulas are written `outcome ~ observation`. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import statsmodels.formula.api as smf\n",
"\n",
"lm = smf.ols(\"CVD ~ cigarettes\", data=df).fit()\n",
"xmin = df.cigarettes.min()\n",
"xmax = df.cigarettes.max()\n",
"X = numpy.linspace(xmin, xmax, 100)\n",
"# params[0] is the intercept (beta0)\n",
"# params[1] is the slope (beta1)\n",
"Y = lm.params[0] + lm.params[1] * X\n",
"plt.plot(df.cigarettes, df.CVD, \"o\")\n",
"plt.plot(X, Y, color=\"darkgreen\")\n",
"plt.xlabel(\"Cigarettes smoked per day\")\n",
"plt.ylabel(\"Deaths from cardiovascular disease\")\n",
"plt.margins(0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the regression model, $\\beta_0$ is the **intercept** of the regression line and $\\beta_1$ is its **slope**. \n",
"\n",
"We can make a similar plot for lung cancer deaths. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lm = smf.ols(\"lung ~ cigarettes\", data=df).fit()\n",
"xmin = df.cigarettes.min()\n",
"xmax = df.cigarettes.max()\n",
"X = numpy.linspace(xmin, xmax, 100)\n",
"# params[0] is the intercept (beta0)\n",
"# params[1] is the slope (beta1)\n",
"Y = lm.params[0] + lm.params[1] * X\n",
"plt.plot(df.cigarettes, df.lung, \"o\")\n",
"plt.plot(X, Y, color=\"darkgreen\")\n",
"plt.xlabel(\"Cigarettes smoked per day\")\n",
"plt.ylabel(\"Lung cancer deaths\")\n",
"plt.margins(0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the linear model for **prediction**, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smoke 15 cigarettes per day is"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.001705\n",
"dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# use the fitted model for prediction\n",
"lm.predict({\"cigarettes\": [15]}) / 100000.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we have divided by 100000, because the data given is the number of deaths per 100000 people. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine some information on the “goodness of fit”, or the quality of the linear model:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/emarsden/Dropbox/RiskEngineering/courseware/notebooks/RE-venv/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 4 samples were given.\n",
" warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
OLS Regression Results
\n",
"
\n",
"
Dep. Variable:
lung
R-squared:
0.987
\n",
"
\n",
"
\n",
"
Model:
OLS
Adj. R-squared:
0.980
\n",
"
\n",
"
\n",
"
Method:
Least Squares
F-statistic:
151.8
\n",
"
\n",
"
\n",
"
Date:
Thu, 06 Jul 2023
Prob (F-statistic):
0.00652
\n",
"
\n",
"
\n",
"
Time:
15:24:55
Log-Likelihood:
-16.359
\n",
"
\n",
"
\n",
"
No. Observations:
4
AIC:
36.72
\n",
"
\n",
"
\n",
"
Df Residuals:
2
BIC:
35.49
\n",
"
\n",
"
\n",
"
Df Model:
1
\n",
"
\n",
"
\n",
"
Covariance Type:
nonrobust
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
t
P>|t|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
1.6000
17.097
0.094
0.934
-71.964
75.164
\n",
"
\n",
"
\n",
"
cigarettes
11.2600
0.914
12.321
0.007
7.328
15.192
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Omnibus:
nan
Durbin-Watson:
2.086
\n",
"
\n",
"
\n",
"
Prob(Omnibus):
nan
Jarque-Bera (JB):
0.534
\n",
"
\n",
"
\n",
"
Skew:
-0.143
Prob(JB):
0.766
\n",
"
\n",
"
\n",
"
Kurtosis:
1.233
Cond. No.
31.4
\n",
"
\n",
"
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & lung & \\textbf{ R-squared: } & 0.987 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.980 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 151.8 \\\\\n",
"\\textbf{Date:} & Thu, 06 Jul 2023 & \\textbf{ Prob (F-statistic):} & 0.00652 \\\\\n",
"\\textbf{Time:} & 15:24:55 & \\textbf{ Log-Likelihood: } & -16.359 \\\\\n",
"\\textbf{No. Observations:} & 4 & \\textbf{ AIC: } & 36.72 \\\\\n",
"\\textbf{Df Residuals:} & 2 & \\textbf{ BIC: } & 35.49 \\\\\n",
"\\textbf{Df Model:} & 1 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 1.6000 & 17.097 & 0.094 & 0.934 & -71.964 & 75.164 \\\\\n",
"\\textbf{cigarettes} & 11.2600 & 0.914 & 12.321 & 0.007 & 7.328 & 15.192 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & nan & \\textbf{ Durbin-Watson: } & 2.086 \\\\\n",
"\\textbf{Prob(Omnibus):} & nan & \\textbf{ Jarque-Bera (JB): } & 0.534 \\\\\n",
"\\textbf{Skew:} & -0.143 & \\textbf{ Prob(JB): } & 0.766 \\\\\n",
"\\textbf{Kurtosis:} & 1.233 & \\textbf{ Cond. No. } & 31.4 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: lung R-squared: 0.987\n",
"Model: OLS Adj. R-squared: 0.980\n",
"Method: Least Squares F-statistic: 151.8\n",
"Date: Thu, 06 Jul 2023 Prob (F-statistic): 0.00652\n",
"Time: 15:24:55 Log-Likelihood: -16.359\n",
"No. Observations: 4 AIC: 36.72\n",
"Df Residuals: 2 BIC: 35.49\n",
"Df Model: 1 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 1.6000 17.097 0.094 0.934 -71.964 75.164\n",
"cigarettes 11.2600 0.914 12.321 0.007 7.328 15.192\n",
"==============================================================================\n",
"Omnibus: nan Durbin-Watson: 2.086\n",
"Prob(Omnibus): nan Jarque-Bera (JB): 0.534\n",
"Skew: -0.143 Prob(JB): 0.766\n",
"Kurtosis: 1.233 Cond. No. 31.4\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lm.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations), the 95% confidence interval for our model parameters $\\beta_0$ and $\\beta_1$ is quite large. In fact, our 4 data points are based on a large number of observations, so our confidence in this data is high, but unfortunately we don't have access to the original data (where each observation corresponds to one person)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [Seaborn package](https://seaborn.pydata.org/) provides convenient functions for making plots of linear regression models. In particular, the `regplot` function generates a regression plot that includes 95% confidence intervals for the model parameters."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"sns.regplot(df, x=\"cigarettes\", y=\"CVD\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}