Home Course Concepts About

Basic statistics

This notebook is an element of the free risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden eric.marsden@risk-engineering.org.


This notebook contains an introduction to use of Python and the NumPy library for basic statistical calculations. See the associated course materials for background information and to download this content as a Jupyter notebook.

We start by importing the numpy library, which makes it possible to use functions and variables from the library, prefixed by numpy.

In [1]:
import numpy

We can use Python as simple interactive calculator:

In [2]:
2 + 3 + 4
Out[2]:
9

Here we call the sqrt function from the numpy library.

In [3]:
numpy.sqrt(2 + 2)
Out[3]:
2.0

Some useful constants are predefined.

In [4]:
numpy.pi
Out[4]:
3.141592653589793
In [5]:
numpy.sin(numpy.pi)
Out[5]:
1.2246467991473532e-16

The notation e-16 above means $10^{-16}$; the number above is very very small (it’s a numerical approximation to the mathematical answer of zero).

We can generate a random number from a uniform distribution between 20 and 30. If you evaluate this several times (in most Jupyter interfaces, press Shift-Enter or press on the Run button in the toolbar above), it will generate a different random number each time.

In [6]:
numpy.random.uniform(20, 30)
Out[6]:
21.631435248293304
In [7]:
numpy.random.uniform(20, 30)
Out[7]:
21.404055478623928

We can generate an array of random numbers by passing a third argument to the numpy.random.uniform function, saying how many random numbers we want. We store the array in a variable named obs.

In [8]:
obs = numpy.random.uniform(20, 30, 10)
obs
Out[8]:
array([28.45152568, 24.57815844, 20.72261282, 29.20824763, 26.30761728,
       28.66203762, 25.75511386, 21.9665234 , 24.27604195, 24.17604464])

The builtin function len in Python tells us the length of an array or a list.

In [9]:
len(obs)
Out[9]:
10

We can do arithmetic on arrays, adding them together or subtracting a constant from each element.

In [10]:
obs + obs
Out[10]:
array([56.90305136, 49.15631687, 41.44522565, 58.41649525, 52.61523455,
       57.32407525, 51.51022773, 43.93304679, 48.55208389, 48.35208928])
In [11]:
obs - 25
Out[11]:
array([ 3.45152568, -0.42184156, -4.27738718,  4.20824763,  1.30761728,
        3.66203762,  0.75511386, -3.0334766 , -0.72395805, -0.82395536])

We can apply a numpy function to all the elements of an array.

In [12]:
numpy.sqrt(obs)
Out[12]:
array([5.33399716, 4.95763638, 4.55220966, 5.40446553, 5.12909517,
       5.35369383, 5.07494964, 4.68684578, 4.92707235, 4.91691414])

The array has methods, a kind of function that acts on the array.

In [13]:
obs.mean()
Out[13]:
25.410392331264497
In [14]:
obs.sum()
Out[14]:
254.10392331264498
In [15]:
obs.min()
Out[15]:
20.722612822631138

There are similar functions in the numpy library that take an array as argument:

In [16]:
numpy.mean(obs)
Out[16]:
25.410392331264497
In [17]:
numpy.sum(obs)
Out[17]:
254.10392331264498
In [18]:
numpy.min(obs)
Out[18]:
20.722612822631138

Simple plotting

The matplotlib library allows you to generate many types of plots and statistical graphs in a convenient way. The online gallery shows the variety of plots available, and the documentation is also available online. We import the pyplot component of matplotlib and give it an alias plt.

In [19]:
import matplotlib.pyplot as plt
plt.style.use("bmh")  # this affects the style (colors etc.) of plots
%config InlineBackend.figure_formats=["svg"]
In [20]:
X = numpy.random.uniform(20, 30, 10)
Y = numpy.random.uniform(50, 100, 10)
plt.scatter(X, Y);
2021-05-25T14:50:52.203408 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [21]:
x = numpy.linspace(-2, 10, 100)
plt.plot(x, numpy.sin(x));
2021-05-25T14:50:52.294882 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

We can add two vectors together, assuming that all their dimensions are identical. Our array $x$ has one dimension of size 100. We can add another random vector of size 100 to it, containing numbers drawn from a uniform probability distribution between -0.1 and 0.1 (these represent some random “noise” which is added to our sine curve).

In [22]:
x = numpy.linspace(0, 10, 100)
obs = numpy.sin(x) + numpy.random.uniform(-0.1, 0.1, 100)
plt.plot(x, obs);
2021-05-25T14:50:52.409973 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

The central limit theorem states that the sum of a number of independent random variables tends toward a normal distribution even if the original variables themselves are not normally distributed. We illustrate this result by examining the distribution of the sums of 1000 realizations of a uniformly distributed random variable, plotting the distribution as a histogram.

In [23]:
N = 10000
sim = numpy.zeros(N)
for i in range(N):
    sim[i] = numpy.random.uniform(30, 40, 100).sum()
plt.hist(sim, bins=40, alpha=0.5, density=True);
2021-05-25T14:50:52.588720 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [ ]: