{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"\"\n", "\n", "This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/).\n", "\n", "Author: Eric Marsden \n", "\n", "---\n", "\n", "In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the [accompanying course material](https://risk-engineering.org/linear-regression-analysis/) for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression analysis of smoking data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy\n", "import pandas\n", "import scipy.stats\n", "import matplotlib.pyplot as plt\n", "plt.style.use(\"bmh\")\n", "%config InlineBackend.figure_formats=[\"svg\"]\n", "\n", "import warnings\n", "warnings.simplefilter(action=\"ignore\", category=FutureWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *British Doctors’ Study* followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses. \n", "\n", "The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking:\n", "\n", "- never smoked\n", "- 1-14 per day\n", "- 15-24 per day\n", "- more than 25 per day\n", "\n", "The data is as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cigarettesCVDlung
0057214
110802105
220892208
3301025355
\n", "
" ], "text/plain": [ " cigarettes CVD lung\n", "0 0 572 14\n", "1 10 802 105\n", "2 20 892 208\n", "3 30 1025 355" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pandas.DataFrame({\"cigarettes\" : [0,10,20,30], \"CVD\" : [572,802,892,1025], \"lung\" : [14, 105, 208, 355]})\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: CVD is an acronym for cardio-vascular disease. \n", "\n", "Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2024-10-30T11:02:37.726461\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.8.3, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(df.cigarettes, df.CVD, \"o\")\n", "plt.margins(0.1)\n", "plt.title(\"Deaths for different smoking intensities\", weight=\"bold\")\n", "plt.xlabel(\"Cigarettes smoked per day\")\n", "plt.ylabel(\"CVD deaths\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the `statsmodels` library (an alternative possibility is to use the `scikit-learn` library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in `statsmodels.formula.api`. \n", "\n", "For simple linear regression (with a single predictor variable), formulas are written `outcome ~ observation`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2024-10-30T11:02:37.960385\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.8.3, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import statsmodels.formula.api as smf\n", "\n", "lm = smf.ols(\"CVD ~ cigarettes\", data=df).fit()\n", "xmin = df.cigarettes.min()\n", "xmax = df.cigarettes.max()\n", "X = numpy.linspace(xmin, xmax, 100)\n", "# params[0] is the intercept (beta0)\n", "# params[1] is the slope (beta1)\n", "Y = lm.params[0] + lm.params[1] * X\n", "plt.plot(df.cigarettes, df.CVD, \"o\")\n", "plt.plot(X, Y, color=\"darkgreen\")\n", "plt.xlabel(\"Cigarettes smoked per day\")\n", "plt.ylabel(\"Deaths from cardiovascular disease\")\n", "plt.margins(0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the regression model, $\\beta_0$ is the **intercept** of the regression line and $\\beta_1$ is its **slope**. \n", "\n", "We can make a similar plot for lung cancer deaths. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2024-10-30T11:02:38.087929\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.8.3, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "lm = smf.ols(\"lung ~ cigarettes\", data=df).fit()\n", "xmin = df.cigarettes.min()\n", "xmax = df.cigarettes.max()\n", "X = numpy.linspace(xmin, xmax, 100)\n", "# params[0] is the intercept (beta0)\n", "# params[1] is the slope (beta1)\n", "Y = lm.params[0] + lm.params[1] * X\n", "plt.plot(df.cigarettes, df.lung, \"o\")\n", "plt.plot(X, Y, color=\"darkgreen\")\n", "plt.xlabel(\"Cigarettes smoked per day\")\n", "plt.ylabel(\"Lung cancer deaths\")\n", "plt.margins(0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the linear model for **prediction**, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smoke 15 cigarettes per day is" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.001705\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the fitted model for prediction\n", "lm.predict({\"cigarettes\": [15]}) / 100000.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have divided by 100000, because the data given is the number of deaths per 100000 people. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can examine some information on the “goodness of fit”, or the quality of the linear model:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib/python3/dist-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 4 samples were given.\n", " warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: lung R-squared: 0.987
Model: OLS Adj. R-squared: 0.980
Method: Least Squares F-statistic: 151.8
Date: Wed, 30 Oct 2024 Prob (F-statistic): 0.00652
Time: 11:02:38 Log-Likelihood: -16.359
No. Observations: 4 AIC: 36.72
Df Residuals: 2 BIC: 35.49
Df Model: 1
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 1.6000 17.097 0.094 0.934 -71.964 75.164
cigarettes 11.2600 0.914 12.321 0.007 7.328 15.192
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: nan Durbin-Watson: 2.086
Prob(Omnibus): nan Jarque-Bera (JB): 0.534
Skew: -0.143 Prob(JB): 0.766
Kurtosis: 1.233 Cond. No. 31.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & lung & \\textbf{ R-squared: } & 0.987 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.980 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 151.8 \\\\\n", "\\textbf{Date:} & Wed, 30 Oct 2024 & \\textbf{ Prob (F-statistic):} & 0.00652 \\\\\n", "\\textbf{Time:} & 11:02:38 & \\textbf{ Log-Likelihood: } & -16.359 \\\\\n", "\\textbf{No. Observations:} & 4 & \\textbf{ AIC: } & 36.72 \\\\\n", "\\textbf{Df Residuals:} & 2 & \\textbf{ BIC: } & 35.49 \\\\\n", "\\textbf{Df Model:} & 1 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 1.6000 & 17.097 & 0.094 & 0.934 & -71.964 & 75.164 \\\\\n", "\\textbf{cigarettes} & 11.2600 & 0.914 & 12.321 & 0.007 & 7.328 & 15.192 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & nan & \\textbf{ Durbin-Watson: } & 2.086 \\\\\n", "\\textbf{Prob(Omnibus):} & nan & \\textbf{ Jarque-Bera (JB): } & 0.534 \\\\\n", "\\textbf{Skew:} & -0.143 & \\textbf{ Prob(JB): } & 0.766 \\\\\n", "\\textbf{Kurtosis:} & 1.233 & \\textbf{ Cond. No. } & 31.4 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: lung R-squared: 0.987\n", "Model: OLS Adj. R-squared: 0.980\n", "Method: Least Squares F-statistic: 151.8\n", "Date: Wed, 30 Oct 2024 Prob (F-statistic): 0.00652\n", "Time: 11:02:38 Log-Likelihood: -16.359\n", "No. Observations: 4 AIC: 36.72\n", "Df Residuals: 2 BIC: 35.49\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 1.6000 17.097 0.094 0.934 -71.964 75.164\n", "cigarettes 11.2600 0.914 12.321 0.007 7.328 15.192\n", "==============================================================================\n", "Omnibus: nan Durbin-Watson: 2.086\n", "Prob(Omnibus): nan Jarque-Bera (JB): 0.534\n", "Skew: -0.143 Prob(JB): 0.766\n", "Kurtosis: 1.233 Cond. No. 31.4\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations), the 95% confidence interval for our model parameters $\\beta_0$ and $\\beta_1$ is quite large. In fact, our 4 data points are based on a large number of observations, so our confidence in this data is high, but unfortunately we don't have access to the original data (where each observation corresponds to one person)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [Seaborn package](https://seaborn.pydata.org/) provides convenient functions for making plots of linear regression models. In particular, the `regplot` function generates a regression plot that includes 95% confidence intervals for the model parameters." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2024-10-30T11:02:38.349180\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.8.3, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "\n", "sns.regplot(df, x=\"cigarettes\", y=\"CVD\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 1 }