This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden eric.marsden@risk-engineering.org.


In this notebook, we illustrate NumPy features for working with correlated data. Check the associated lecture slides for background material.

Linear correlation

In [1]:
import numpy
import matplotlib.pyplot as plt
import scipy.stats
import seaborn as sns
%config InlineBackend.figure_formats=['svg']
In [2]:
X = numpy.random.normal(10, 1, 100)
X
Out[2]:
array([10.54634181,  8.27731031, 11.48441557, 10.40032297,  9.11513404,
       10.55836025, 10.49809662, 10.56591951,  9.95087915,  9.4555811 ,
        9.68135968,  9.62785161,  9.91575233,  9.62519388,  7.96951419,
        9.52102449,  9.51596514,  7.54188909, 10.57085713, 10.07737404,
       10.92852595, 10.62225885,  9.34647349, 10.09543129, 10.92334675,
        9.84047235,  9.62999803,  9.93365459,  9.20085571, 10.52259386,
        8.67601227,  8.94753369, 11.99948364, 10.07005381,  8.3984818 ,
        9.20744896, 12.0654234 , 10.77675771, 10.14602797,  9.97576352,
       11.63661034,  9.45015653,  9.75922628, 10.108028  ,  8.45893401,
       11.58097689, 10.91769024,  9.69646503, 12.13807618,  8.87333389,
        9.28958556,  9.05999126,  9.23319174, 10.87419542, 10.23828919,
       10.56169507,  9.02722755, 11.3183529 , 10.29524581, 10.02812171,
        9.94477828, 10.31919171, 10.52152547, 10.34576547,  9.67683946,
        9.91671971,  9.94313009, 11.81581693, 10.03627378, 10.73295603,
        9.02673737,  9.86333805, 10.98206334, 11.59051769,  8.78956037,
        9.83468298,  9.17751584,  9.106725  , 11.63178127,  9.1397129 ,
       10.60414261, 10.75093527,  9.2358853 ,  8.20442847, 10.37175358,
        9.370904  , 10.35796362,  9.25723684, 11.67368628, 10.84753015,
       10.56253818, 10.99778184,  8.25150301,  9.9699814 , 11.04678008,
        9.7546405 ,  8.886361  , 10.24278702,  8.44996347,  9.66457532])
In [3]:
Y = -X + numpy.random.normal(0, 1, 100)
In [4]:
plt.scatter(X, Y);

Looking at the scatterplot above, we can see that the random variables $X$ and $Y$ are correlated. There are various statistical measures that allow us to quantify the degree of linear correlation. The most commonly used is Pearson’s product-moment correlation coefficient. It is available in scipy.stats.

In [5]:
scipy.stats.pearsonr(X, Y)
Out[5]:
(-0.7090187366289592, 1.5287803187216544e-16)

The first return value is the linear correlation coefficient, a value between -1 and 1 which measures the strength of the linear correlation. A value greater than 0.9 indicates a strong positive linear correlation, and a value lower than -0.9 indicates strong negative linear correlation (when $X$ increases, $Y$ decreases).

(The second return value is a p-value, which is a measure of the confidence which can be placed in the estimation of the correlation coefficient (smaller = more confidence). It tells you the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. Here we have a very very low p-value, so high confidence in the estimated value of the correlation coefficient.)

Exercises

Exercise: show that when the error in $Y$ decreases, the correlation coefficient increases.

Exercise: produce data and a plot with a negative correlation coefficient.

Anscombe’s quartet

Let’s examine four datasets produced by the statistician Francis Anscombe to illustrate the importance of exploring your data qualitatively (for example by plotting the data), rather than relying only on summary statistics such as the linear correlation coefficient.

In [6]:
x =  numpy.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = numpy.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = numpy.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = numpy.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = numpy.array([8,8,8,8,8,8,8,19,8,8,8])
y4 = numpy.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
In [7]:
plt.plot(x, y1, 'ks', marker='o', color='blue')
plt.title("Anscombe quartet n°1")
plt.margins(0.1)
In [8]:
scipy.stats.pearsonr(x, y1)
Out[8]:
(0.81642051634484, 0.002169628873078789)
In [9]:
plt.plot(x, y2, 'ks', marker='o', color='blue')
plt.title("Anscombe quartet n°2")
plt.margins(0.1)
In [10]:
scipy.stats.pearsonr(x, y2)
Out[10]:
(0.8162365060002427, 0.002178816236910803)
In [11]:
plt.plot(x, y3, 'ks', marker='o', color='blue')
plt.title(u"Anscombe quartet n°3")
plt.margins(0.1)
In [12]:
scipy.stats.pearsonr(x, y3)
Out[12]:
(0.8162867394895981, 0.0021763052792280304)
In [13]:
plt.plot(x4, y4, 'ks', marker='o', color='blue')
plt.title("Anscombe quartet n°4")
plt.margins(0.1)
In [14]:
scipy.stats.pearsonr(x4, y4)
Out[14]:
(0.816521436888503, 0.0021646023471972127)

Notice that the linear correlation coefficient of the four datasets is identical, though clearly the relationship between $X$ and $Y$ is very different in each case! This illustrates the risks of depending only on quantitative descriptors to understand your datasets: you should also use different types of plots to give you a better overview of the data.

The Datasaurus

The Datasaurus provides another illustration of the importance of plotting your data to make sure it doesn't contain any surprises, rather than relying only on summary statistics.

In [15]:
import pandas

ds = pandas.read_csv("https://risk-engineering.org/static/data/datasaurus.csv", header=None)
ds.describe()
Out[15]:
0 1
count 142.000000 142.000000
mean 54.263273 47.832253
std 16.765142 26.935403
min 22.307700 2.948700
25% 44.102600 25.288450
50% 53.333300 46.025600
75% 64.743600 68.525675
max 98.205100 99.487200
In [16]:
scipy.stats.pearsonr(ds[0], ds[1])
Out[16]:
(-0.06447185270095167, 0.4458965980247055)

These summary statistics don't look too nasty, but check out the data once it's plotted!

In [17]:
plt.scatter(ds[0], ds[1]);

This article titled Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing runs a lot further with this concept.

In [ ]: