Bootstrap analysis of helmet safety data¶

This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden eric.marsden@risk-engineering.org.

This notebook contains an short case study using Python and SciPy to analyze some online data on the safety performance of different helmet technologies. It includes regression analysis to estimate the impact of technology on performance, and bootstrap resampling to obtain an estimate despite a small sample size. See the associated course materials for background information and to download this content as a Jupyter notebook.

In [1]:

import numpy
import scipy.stats
import pandas
import matplotlib.pyplot as plt
plt.style.use("bmh")
%config InlineBackend.figure_formats=["svg"]

Acknowledgement. This notebook is heavily inspired by a blog post by Daniel Phadley where the analysis is undertaken in R.

We are going to analyze test data produced by researchers at Virginia Tech (a public university in the US state of Virginia). The researchers run impact tests on a range of helmets and assess the likelihood of sustaining a concussion from different types of impacts. From this, they attribute a performance score. This type of testing, and of course publication of the results to help consumers make an informed choice, is very important to encourage manufacturers to improve the safety performance of their products; congratulations to the Virginia Tech researchers.

Their data is published on a web page, but isn’t available in a convenient format that we can use directly from Python, such as CSV or JSON. We’re going to have to “scrape” the data from the web page, which means download the web page content and use some simple parsing techniques to extract the data. This is a fairly common activity for data analysts, because quite often people who generate interesting data don’t think of, or sometimes don’t want to, make it available as a dataset.

Note: this notebook was updated in April 2022 because the content of the web page was changed, so the scraping code had to be updated. A frequent and annoying problem when web scraping...

Scraping data from web page¶

Techniques for scraping data from the web are fairly well documented online. They require a little technical knowledge of HTML parsing and a little Python text processing.

In [2]:

import requests
import re
import io

page = requests.get("https://helmet.beam.vt.edu/js/bicycleData.js")
js = page.content
if not js.startswith(b"var bicycleDataRaw ="):
    raise RuntimeError("HTML has changed, need to update the parser!")
js = js[21:-1]
# we use a regexp to convert the Javascript array containing the data
# into the JSON format that is easily parsed from Python. 
json_bytes = re.sub(b'\\s([a-z]+):\\s', b'"\\1":', js) + b"]"
helmets = pandas.read_json(io.BytesIO(json_bytes))
len(helmets)

Out[2]:

In [3]:

helmets.head()

Out[3]:

	brand	model	score	rating	cost	style	date	photo	certifications	low	high
0	Specialized	Tactic 4	8.554	5	$110	Mountain	2021	specialized-tactic-4.jpg	CPSC	2.019	6.535
1	Sweet Protection	Trailblazer MIPS	8.687	5	$180	Mountain	2021	sweet-protection-trailblazer-mips.jpg	CPSC	2.031	6.655
2	Specialized	Mode	8.803	5	$120	Urban	2021	specialized-mode.jpg	CPSC	2.423	6.379
3	Fox	Dropframe Pro (2020 version)	8.853	5	$200	Mountain	2020	fox-dropframe-pro-2020.jpg	CPSC	2.298	6.553
4	Giant	Rev Comp MIPS	9.125	5	$65	Road	2021	giant-rev-comp-mips.jpg	CPSC	2.202	6.922

The cost is expressed in an inconvenient format with a prefixed $ sign; we convert that to a numerical field. Also drop a column that we don't need in our analysis.

In [4]:

helmets["price"] = helmets["cost"].map(lambda s: float(s[1:]))
helmets.drop(columns=["photo"], inplace=True)

In [5]:

helmets.head()

Out[5]:

	brand	model	score	rating	cost	style	date	certifications	low	high	price
0	Specialized	Tactic 4	8.554	5	$110	Mountain	2021	CPSC	2.019	6.535	110.0
1	Sweet Protection	Trailblazer MIPS	8.687	5	$180	Mountain	2021	CPSC	2.031	6.655	180.0
2	Specialized	Mode	8.803	5	$120	Urban	2021	CPSC	2.423	6.379	120.0
3	Fox	Dropframe Pro (2020 version)	8.853	5	$200	Mountain	2020	CPSC	2.298	6.553	200.0
4	Giant	Rev Comp MIPS	9.125	5	$65	Road	2021	CPSC	2.202	6.922	65.0

Let's explore our dataset a little, looking at the distribution of helmet price and of performance score.

In [6]:

plt.hist(helmets.price, bins=20, alpha=0.5)
plt.title("Distribution of helmet price");

No description has been provided for this image

In [7]:

plt.hist(helmets.score, bins=20, alpha=0.5)
plt.title("Distribution of helmet score");

We can examine whether there is a correlation between helmet price and performance score.

In [8]:

helmets.plot("price", "score", "scatter", alpha=0.5)
plt.xlabel("Helmet price (USD)")
plt.ylabel("Helmet score (lower is better)");

Analyze impact of helmet technology¶

The helmets in this dataset use three design technologies:

MIPS or Multi-directional Impact Protection System, which reduces the harmful rotational motion generated on impact which would otherwise be transferred to the brain
WaveCel, a shock-absorbing material with a collapsible cellular structure. This technology is only available from a single manufacturer, so few helmets are available.
conventional impact-absorbing materials, with a certification provided by the US Consumer Product Safety Commission (all helmets sold in the USA must meet this certification)

There is some commercial and technical squabbling about the relative performance of these design approaches, so it’s interesting to see what the test data says.

The helmet technology is not represented explicitly in our dataset, but it’s encoded in the helmet name.

In [9]:

def technology_from_model(name: str):
    if "MIPS" in name: return "MIPS"
    if "WaveCel" in name: return "WaveCel"
    return "conventional"

We can add a column to the dataset with the design technology used.

In [10]:

helmets["technology"] = helmets["model"].apply(technology_from_model)
helmets.head(20)

Out[10]:

	brand	model	score	rating	cost	style	date	certifications	low	high	price	technology
0	Specialized	Tactic 4	8.554	5	$110	Mountain	2021	CPSC	2.019	6.535	110.0	conventional
1	Sweet Protection	Trailblazer MIPS	8.687	5	$180	Mountain	2021	CPSC	2.031	6.655	180.0	MIPS
2	Specialized	Mode	8.803	5	$120	Urban	2021	CPSC	2.423	6.379	120.0	conventional
3	Fox	Dropframe Pro (2020 version)	8.853	5	$200	Mountain	2020	CPSC	2.298	6.553	200.0	conventional
4	Giant	Rev Comp MIPS	9.125	5	$65	Road	2021	CPSC	2.202	6.922	65.0	MIPS
5	Lazer	G1 MIPS	9.226	5	$240	Road	2020	CPSC	2.396	6.830	240.0	MIPS
6	Bontrager	Rally MIPS	9.345	5	$150	Mountain	2019	CPSC	2.335	7.010	150.0	MIPS
7	Specialized	Align II	9.554	5	$50	Multi-sport	2020	CPSC	2.954	6.600	50.0	conventional
8	Lazer	Tonic MIPS	9.848	5	$80	Road	2021	CPSC	2.557	7.290	80.0	MIPS
9	Troy Lee Designs	A2 MIPS Decoy	9.987	5	$179	Mountain	2019	CPSC	2.523	7.464	179.0	MIPS
10	Lazer	Century MIPS	9.996	5	$180	Road	2019	CPSC	3.016	6.969	180.0	MIPS
11	SCOTT	Centric Plus MIPS	10.199	5	$200	Road	2020	CPSC	2.808	7.390	200.0	MIPS
12	Giant	Rev Pro MIPS	10.219	5	$273	Road	2021	CPSC	3.144	7.074	273.0	MIPS
13	Liv	Rev Pro MIPS	10.219	5	$273	Road	2021	CPSC	3.144	7.074	273.0	MIPS
14	Lazer	Sphere MIPS	10.325	5	$160	Road	2020	CPSC	2.788	7.536	160.0	MIPS
15	Lazer	Cyclone MIPS	10.502	5	$75	Multi-sport	2019	CPSC	3.265	7.236	75.0	MIPS
16	POC	Octal X SPIN	10.587	5	$250	Road	2020	CPSC	3.207	7.379	250.0	conventional
17	SCOTT	ARX PLUS MIPS 2020	10.747	5	$100	Multi-sport	2020	CPSC	2.794	7.950	100.0	MIPS
18	Bontrager	Specter WaveCel	10.790	5	$150	Road	2019	CPSC	3.538	7.261	150.0	WaveCel
19	Fox	Speedframe Pro MIPS	10.839	5	$160	Mountain	2020	CPSC	3.444	7.394	160.0	MIPS

Let’s examine a few plots of this dataset to get a feel for the impact of helmet technology on performance score, and see whether there’s a relationship between price and performance.

In [11]:

import seaborn as sns

sns.violinplot(x="technology", y="score", data=helmets);

A violin plot, shown above, is an enhanced version of the traditional boxplot, or “box-and-whisker” plot.

This plot shows that MIPS and WaveCel are clearly scoring better (remember that lower scores indicate better safety performance) than conventional helmets, but it isn’t very helpful in determining whether MIPS performs better than WaveCel (the median points for the two categories, shown by the white dot in the violin plot, are very similar). We'll undertake a more sophisticated analysis below to help resolve this question.

In [12]:

sns.relplot(x="price", y="score", hue="technology", data=helmets);

Note how the plot above encodes information on three dimensions of our dataset, the price and score but also the technology as the color of the points. This plot highlights the fact that we have few helmets in our dataset using the WaveCel technology.

We can also check whether the type of helmet (targeting road use, urban use or mountainbike use) is correlated with performance.

In [13]:

sns.violinplot(x="style", y="score", data=helmets);

Regression analysis¶

We will run a regression analysis to analyze the comparative impact of helmet technology (MIPS or WaveCel) on performance score. To do this, we need to add columns to the Pandas dataframe indicating whether each technology is present.

In [14]:

# generate extra boolean columns that indicate whether MIPS and WaveCell technologies
# are present
helmets["has_mips"] = helmets["technology"].apply(lambda t: t == "MIPS")
helmets["has_wavecel"] = helmets["technology"].apply(lambda t: t == "WaveCel")
helmets.head(20)

Out[14]:

	brand	model	score	rating	cost	style	date	certifications	low	high	price	technology	has_mips	has_wavecel
0	Specialized	Tactic 4	8.554	5	$110	Mountain	2021	CPSC	2.019	6.535	110.0	conventional	False	False
1	Sweet Protection	Trailblazer MIPS	8.687	5	$180	Mountain	2021	CPSC	2.031	6.655	180.0	MIPS	True	False
2	Specialized	Mode	8.803	5	$120	Urban	2021	CPSC	2.423	6.379	120.0	conventional	False	False
3	Fox	Dropframe Pro (2020 version)	8.853	5	$200	Mountain	2020	CPSC	2.298	6.553	200.0	conventional	False	False
4	Giant	Rev Comp MIPS	9.125	5	$65	Road	2021	CPSC	2.202	6.922	65.0	MIPS	True	False
5	Lazer	G1 MIPS	9.226	5	$240	Road	2020	CPSC	2.396	6.830	240.0	MIPS	True	False
6	Bontrager	Rally MIPS	9.345	5	$150	Mountain	2019	CPSC	2.335	7.010	150.0	MIPS	True	False
7	Specialized	Align II	9.554	5	$50	Multi-sport	2020	CPSC	2.954	6.600	50.0	conventional	False	False
8	Lazer	Tonic MIPS	9.848	5	$80	Road	2021	CPSC	2.557	7.290	80.0	MIPS	True	False
9	Troy Lee Designs	A2 MIPS Decoy	9.987	5	$179	Mountain	2019	CPSC	2.523	7.464	179.0	MIPS	True	False
10	Lazer	Century MIPS	9.996	5	$180	Road	2019	CPSC	3.016	6.969	180.0	MIPS	True	False
11	SCOTT	Centric Plus MIPS	10.199	5	$200	Road	2020	CPSC	2.808	7.390	200.0	MIPS	True	False
12	Giant	Rev Pro MIPS	10.219	5	$273	Road	2021	CPSC	3.144	7.074	273.0	MIPS	True	False
13	Liv	Rev Pro MIPS	10.219	5	$273	Road	2021	CPSC	3.144	7.074	273.0	MIPS	True	False
14	Lazer	Sphere MIPS	10.325	5	$160	Road	2020	CPSC	2.788	7.536	160.0	MIPS	True	False
15	Lazer	Cyclone MIPS	10.502	5	$75	Multi-sport	2019	CPSC	3.265	7.236	75.0	MIPS	True	False
16	POC	Octal X SPIN	10.587	5	$250	Road	2020	CPSC	3.207	7.379	250.0	conventional	False	False
17	SCOTT	ARX PLUS MIPS 2020	10.747	5	$100	Multi-sport	2020	CPSC	2.794	7.950	100.0	MIPS	True	False
18	Bontrager	Specter WaveCel	10.790	5	$150	Road	2019	CPSC	3.538	7.261	150.0	WaveCel	False	True
19	Fox	Speedframe Pro MIPS	10.839	5	$160	Mountain	2020	CPSC	3.444	7.394	160.0	MIPS	True	False

We can now run a linear regression on the dataset to determine which boolean indicator variable (has_mips or has_wavecel) has the largest impact (the largest absolute value). The two indicator variables have negative impact here, which means (remember that lower performance scores are better) that the presence of either technology, compared with the bulk of our sample that uses conventional shock-absorbing materials, improves performance.

In [15]:

import statsmodels.formula.api as smf

lm = smf.ols(formula="score ~ style + has_wavecel + has_mips", data=helmets).fit()
lm.params

Out[15]:

Intercept               13.033344
style[T.Mountain]        1.923756
style[T.Multi-sport]     4.432715
style[T.Road]            2.079305
style[T.Urban]           5.294214
has_wavecel[T.True]     -3.500482
has_mips[T.True]        -3.255376
dtype: float64

In [16]:

# lm.params is a pandas Series, which we can index by the parameter name
lm.params.loc["has_mips[T.True]"]

Out[16]:

-3.255375754439028

On the full sample, WaveCel performs slightly better than MIPS. Our sample size is fairly small, so we can undertake a bootstrap analysis to see what level of confidence we have in that observation.

Bootstrap regression¶

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. Assuming the original dataset is representative of the larger population that it was drawn from, the simulated samples (resampled values) should have similar statistical characteristics to the full dataset. Therefore, if we estimate statistics on our resampled values, it should give as a reasonable estimation for population statistics.

We include background on the bootstrap method and a procedure to implement it in Python (and in Excel if you're into that kind of thing!) in our free course materials on statistical modelling and data analysis methods, and illustrate its use to estimate confidence intervals in this notebook.

Below, we are using the bootstrap method to run a large number (N) of linear regression analyses on the resampled data, to help determine the comparative impact of the two helmet technologies that we want to compare, MIPS and WaveCel.

In [17]:

# a bootstrap analysis comparing the effect of has_wavecel and has_mips
runs = 1000
effect_mips = numpy.empty(runs)
effect_wavecel = numpy.empty(runs)
for i in range(runs):
    sample = helmets.sample(len(helmets), replace=True)
    lm = smf.ols(formula="score ~ style + has_wavecel + has_mips", data=sample).fit()
    effect_mips[i] = lm.params.loc["has_mips[T.True]"]
    effect_wavecel[i] = lm.params.loc["has_wavecel[T.True]"]

In [18]:

plt.hist(effect_mips, bins=30, label="MIPS", color="blue", alpha=0.3)
plt.hist(effect_wavecel, bins=30, label="WaveCel", color="red", alpha=0.3)
plt.legend();

We can calculate the proportion of bootstrap samples where WaveCel dominates MIPS (ie its performance score is lower, remember that a lower score is better here).

In [19]:

(effect_wavecel < effect_mips).sum() / float(len(effect_wavecel))

Out[19]:

0.768

By Eric Marsden