This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.
Author: Eric Marsden eric.marsden@risk-engineering.org.
In this notebook, we illustrate NumPy features for working with correlated data. Check the associated lecture slides for background material and to download this content as a Jupyter/Python notebook.
Linear correlation¶
import numpy
import matplotlib.pyplot as plt
import scipy.stats
plt.style.use("bmh")
%config InlineBackend.figure_formats=["svg"]
X = numpy.random.normal(10, 1, 100)
X
Y = -X + numpy.random.normal(0, 1, 100)
plt.scatter(X, Y);
Looking at the scatterplot above, we can see that the random variables $X$ and $Y$ are correlated. There are various statistical measures that allow us to quantify the degree of linear correlation. The most commonly used is Pearson’s product-moment correlation coefficient. It is available in scipy.stats
.
scipy.stats.pearsonr(X, Y)
The first return value is the linear correlation coefficient, a value between -1 and 1 which measures the strength of the linear correlation. A value greater than 0.9 indicates a strong positive linear correlation, and a value lower than -0.9 indicates strong negative linear correlation (when $X$ increases, $Y$ decreases).
(The second return value is a p-value, which is a measure of the confidence which can be placed in the estimation of the correlation coefficient (smaller = more confidence). It tells you the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. Here we have a very very low p-value, so high confidence in the estimated value of the correlation coefficient.)
Exercises¶
Exercise: show that when the error in $Y$ decreases, the correlation coefficient increases.
Exercise: produce data and a plot with a negative correlation coefficient.
Anscombe’s quartet¶
Let’s examine four datasets produced by the statistician Francis Anscombe to illustrate the importance of exploring your data qualitatively (for example by plotting the data), rather than relying only on summary statistics such as the linear correlation coefficient.
x = numpy.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = numpy.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = numpy.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = numpy.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = numpy.array([8,8,8,8,8,8,8,19,8,8,8])
y4 = numpy.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
plt.scatter(x, y1)
plt.title("Anscombe quartet n°1")
plt.margins(0.1)
scipy.stats.pearsonr(x, y1)
plt.scatter(x, y2)
plt.title("Anscombe quartet n°2")
plt.margins(0.1)
scipy.stats.pearsonr(x, y2)
plt.scatter(x, y3)
plt.title("Anscombe quartet n°3")
plt.margins(0.1)
scipy.stats.pearsonr(x, y3)
plt.scatter(x4, y4)
plt.title("Anscombe quartet n°4")
plt.margins(0.1)
scipy.stats.pearsonr(x4, y4)
Notice that the linear correlation coefficient (Pearson's $r$) of the four datasets is identical, though clearly the relationship between $X$ and $Y$ is very different in each case! This illustrates the risks of depending only on quantitative descriptors to understand your datasets: you should also use different types of plots to give you a better overview of the data.
The Datasaurus¶
The Datasaurus provides another illustration of the importance of plotting your data to make sure it doesn't contain any surprises, rather than relying only on summary statistics.
import pandas
ds = pandas.read_csv("https://risk-engineering.org/static/data/datasaurus.csv", header=None)
ds.describe()
scipy.stats.pearsonr(ds[0], ds[1])
These summary statistics don't look too nasty, but check out the data once it's plotted!
plt.scatter(ds[0], ds[1]);
This article titled Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing runs a lot further with this concept.