Author: Eric Marsden email@example.com.
This notebook contains an short case study using Python and SciPy to analyze some online data on the safety performance of different helmet technologies. It includes regression analysis to estimate the impact of technology on performance, and bootstrap resampling to obtain an estimate despite a small sample size. See the associated course materials for background information and to download this content as a Jupyter notebook.
import numpy import scipy.stats import pandas import matplotlib.pyplot as plt plt.style.use("bmh") %config InlineBackend.figure_formats=["svg"]
Acknowledgement. This notebook is heavily inspired by a blog post by Daniel Phadley where the analysis is undertaken in R.
We are going to analyze test data produced by researchers at Virginia Tech (a public university in the US state of Virginia). The researchers run impact tests on a range of helmets and assess the likelihood of sustaining a concussion from different types of impacts. From this, they attribute a performance score. This type of testing, and of course publication of the results to help consumers make an informed choice, is very important to encourage manufacturers to improve the safety performance of their products; congratulations to the Virginia Tech researchers.
Their data is published on a web page, but isn’t available in a convenient format that we can use directly from Python, such as CSV or JSON. We’re going to have to “scrape” the data from the web page, which means download the web page content and use some simple parsing techniques to extract the data. This is a fairly common activity for data analysts, because quite often people who generate interesting data don’t think of, or sometimes don’t want to, make it available as a dataset.
Note: this notebook was updated in April 2022 because the content of the web page was changed, so the scraping code had to be updated. A frequent and annoying problem when web scraping...
Techniques for scraping data from the web are fairly well documented online. They require a little technical knowledge of HTML parsing and a little Python text processing.
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In, line 1 ----> 1 import requests 2 import re 3 import io ModuleNotFoundError: No module named 'requests'
The cost is expressed in an inconvenient format with a prefixed
$ sign; we convert that to a numerical field. Also drop a column that we don't need in our analysis.
helmets["price"] = helmets["cost"].map(lambda s: float(s[1:])) helmets.drop(columns=["photo"], inplace=True)
Let's explore our dataset a little, looking at the distribution of helmet price and of performance score.
plt.hist(helmets.price, bins=20, alpha=0.5) plt.title("Distribution of helmet price");
plt.hist(helmets.score, bins=20, alpha=0.5) plt.title("Distribution of helmet score");
We can examine whether there is a correlation between helmet price and performance score.
helmets.plot("price", "score", "scatter", alpha=0.5) plt.xlabel("Helmet price (USD)") plt.ylabel("Helmet score (lower is better)");
The helmets in this dataset use three design technologies:
MIPS or Multi-directional Impact Protection System, which reduces the harmful rotational motion generated on impact which would otherwise be transferred to the brain
WaveCel, a shock-absorbing material with a collapsible cellular structure. This technology is only available from a single manufacturer, so few helmets are available.
conventional impact-absorbing materials, with a certification provided by the US Consumer Product Safety Commission (all helmets sold in the USA must meet this certification)
There is some commercial and technical squabbling about the relative performance of these design approaches, so it’s interesting to see what the test data says.
The helmet technology is not represented explicitly in our dataset, but it’s encoded in the helmet name.
def technology_from_model(name: str): if "MIPS" in name: return "MIPS" if "WaveCel" in name: return "WaveCel" return "conventional"
We can add a column to the dataset with the design technology used.
helmets["technology"] = helmets["model"].apply(technology_from_model) helmets.head(20)
Let’s examine a few plots of this dataset to get a feel for the impact of helmet technology on performance score, and see whether there’s a relationship between price and performance.
import seaborn as sns sns.violinplot(x="technology", y="score", data=helmets);
A violin plot, shown above, is an enhanced version of the traditional boxplot, or “box-and-whisker” plot.
This plot shows that MIPS and WaveCel are clearly scoring better (remember that lower scores indicate better safety performance) than conventional helmets, but it isn’t very helpful in determining whether MIPS performs better than WaveCel (the median points for the two categories, shown by the white dot in the violin plot, are very similar). We'll undertake a more sophisticated analysis below to help resolve this question.
sns.relplot(x="price", y="score", hue="technology", data=helmets);
Note how the plot above encodes information on three dimensions of our dataset, the price and score but also the technology as the color of the points. This plot highlights the fact that we have few helmets in our dataset using the WaveCel technology.
We can also check whether the type of helmet (targeting road use, urban use or mountainbike use) is correlated with performance.
sns.violinplot(x="style", y="score", data=helmets);
We will run a regression analysis to analyze the comparative impact of helmet technology (MIPS or WaveCel) on performance score. To do this, we need to add columns to the Pandas dataframe indicating whether each technology is present.
# generate extra boolean columns that indicate whether MIPS and WaveCell technologies # are present helmets["has_mips"] = helmets["technology"].apply(lambda t: t == "MIPS") helmets["has_wavecel"] = helmets["technology"].apply(lambda t: t == "WaveCel") helmets.head(20)
We can now run a linear regression on the dataset to determine which boolean indicator variable (
has_wavecel) has the largest impact (the largest absolute value). The two indicator variables have negative impact here, which means (remember that lower performance scores are better) that the presence of either technology, compared with the bulk of our sample that uses conventional shock-absorbing materials, improves performance.
import statsmodels.formula.api as smf lm = smf.ols(formula="score ~ style + has_wavecel + has_mips", data=helmets).fit() lm.params
# lm.params is a pandas Series, which we can index by the parameter name lm.params.loc["has_mips[T.True]"]
On the full sample, WaveCel performs slightly better than MIPS. Our sample size is fairly small, so we can undertake a bootstrap analysis to see what level of confidence we have in that observation.
Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. Assuming the original dataset is representative of the larger population that it was drawn from, the simulated samples (resampled values) should have similar statistical characteristics to the full dataset. Therefore, if we estimate statistics on our resampled values, it should give as a reasonable estimation for population statistics.
We include background on the bootstrap method and a procedure to implement it in Python (and in Excel if you're into that kind of thing!) in our free course materials on statistical modelling and data analysis methods, and illustrate its use to estimate confidence intervals in this notebook.
Below, we are using the bootstrap method to run a large number (N) of linear regression analyses on the resampled data, to help determine the comparative impact of the two helmet technologies that we want to compare, MIPS and WaveCel.
# a bootstrap analysis comparing the effect of has_wavecel and has_mips runs = 1000 effect_mips = numpy.empty(runs) effect_wavecel = numpy.empty(runs) for i in range(runs): sample = helmets.sample(len(helmets), replace=True) lm = smf.ols(formula="score ~ style + has_wavecel + has_mips", data=sample).fit() effect_mips[i] = lm.params.loc["has_mips[T.True]"] effect_wavecel[i] = lm.params.loc["has_wavecel[T.True]"]
plt.hist(effect_mips, bins=30, label="MIPS", color="blue", alpha=0.3) plt.hist(effect_wavecel, bins=30, label="WaveCel", color="red", alpha=0.3) plt.legend();
We can calculate the proportion of bootstrap samples where WaveCel dominates MIPS (ie its performance score is lower, remember that a lower score is better here).
(effect_wavecel < effect_mips).sum() / float(len(effect_wavecel))