Data Science from Scratch (ch5) - Statistics
Descriptive Statistics & Correlations
Table of contents
Overview
This post is chapter 5 in continuation of my coverage of Data Science from Scratch by Joel Grus.
It should be noted upfront that everything covered in this post can be done more expediently and efficiently in libraries like NumPy as well as the statistics module in Python.
The primary value of this book, and by extension this post, in my opinion, is the emphasis on learning how Python primitives can be used to build tools from the ground up.
Specifically, we’ll examine how specific features of the Python language as well as functions we built in a previous post on linear algebra can be used to build tools used to describe data and relationships within data (aka statistics).
I think this is pretty cool. Hopefully you agree.
Example Data
This chapter continues the narrative of you as a newly hired data scientist at DataScienster, the social network for data scientists, and your job is to describe how many friends members in this social network has. We have two lists
of float
to work with. We’ll work with num_friends
first, then daily_minutes
later.
I wanted this post to be self-contained, and in order to do that we’ll have to read in a larger than average list
of floats
. The alternative would be to get the data directly from the book’s
github repo (statistics.py)
num_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]
daily_hours = [dm / 60 for dm in daily_minutes]
Describing
The num_friends
list is a list of numbers representing “number of friends” a person has, so for example, one person has 100 friends. The first thing we do to describe the data is to create a bar chart plotting the number of people who have 100 friends, 49 friends, 41 friends, and so on.
We’ll import Counter
from collections
and import matplotlib.pyplot
.
We’ll use Counter
to turn num_friends
list into a defaultdict(int)
-like object mapping keys to counts. For more info, please refer to this
previous post on the Counters.
Once we use the Counter
collection, a
high-performance container datatype, we can use methods like most_common
to find the keys with the most common values. Here we see that the five most common number of friends are 6, 1, 4, 3 and 9, respectively.
from collections import Counter
import matplotlib.pyplot as plt
friend_counts = Counter(num_friends)
# the five most common values are: 6, 1, 4, 3 and 9 friends
# [(6, 22), (1, 22), (4, 20), (3, 20), (9, 18)]
friend_counts.most_common(5)
To proceed with plotting, we’ll use friend_counts
to create a list comprehension
that will loop through friends_count
and for all keys from 0-101 (xs) and print a corresponding value (if it exists). This becomes the y-axis to num_friends
, which is the x-axis:
xs = range(101) # x-axis: largest num_friend value is 100
ys = [friend_counts[x] for x in xs] # y-axis
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
Here is the plot below. You can see one person with 100 friends.
You can also read more about data visualization here.
Alternatively, we could generate simple statistics to describe the data using built-in Python methods: len
, min
, max
and sorted
.
num_points = len(num_friends) # number of data points in num_friends: 204
largest_value = max(num_friends) # largest value in num_friends: 100
smallest_value = min(num_friends) # smallest value in num_friends: 1
sorted_values = sorted(num_friends) # sort the values in ascending order
second_largest_value = sorted_values[-2] # second largest value from the back: 49
Central Tendencies
The most common way of describing a set of data is to find it’s mean, which is the sum of all the values, divided by the number of values. note : we’ll continue to use type annotations. In my opinion, it helps you be a more deliberate and mindful Python programmer.
from typing import List
def mean(xs: List[float]) -> float:
return sum(xs) / len(xs)
assert 7.3333 < mean(num_friends) < 7.3334
However, the mean is notoriously sensitive to outliers so statisticians often supplement with other measures of central tendencies like median. Because the median is the middle-most value, it matters whether there is an even or odd number of data points.
Here, we’ll create two private functions for both situations - even and odd number of data points - in calculating the median. First, we’ll sort the data values. Then, for even number values, we’ll find the two middle values and split them. For odd number of values, we’ll divide the length of the dataset by 2 (i.e., 50).
Our median function will return either of the private function _median_even
or _median_odd
conditionally depending on if the length of a list of numbers is divisible (%2==0) by 2.
def _median_even(xs: List[float]) -> float:
"""If len(xs) is even, it's the average of the middle two elements"""
sorted_xs = sorted(xs)
hi_midpoint = len(xs) // 2 # e.g. length 4 => hi_midpoint 2
return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2
def _median_odd(xs: List[float]) -> float:
"""If len(xs) is odd, its the middle element"""
return sorted(xs)[len(xs) // 2]
def median(v: List[float]) -> float:
"""Finds the 'middle-most' value of v"""
return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
assert median([1,10,2,9,5]) == 5
assert median([1, 9, 2, 10]) == (2 + 9) / 2
Because the median is the middle-most value, it does not fully depend on every value in the data. For illustration, hypothetically if we have a another list num_friends2
where one person had 10,000 friends, the mean would be much more sensitive to that change than the median would be.
num_friends2 = [10000.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14
,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9
,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7
,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5
,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3
,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1
,1,1,1,1,1,1,1,1,1,1,1,1]
mean(num_friends2) # more sensitive to outliers: 7.333 => 55.86274509803921
median(num_friends2) # less sensitive to outliers: 6.0 => 6.0
You may also used quantiles
to describe your data. Whenever you’ve heard “X percentile”, that is a description of quantiles relative to 100. In fact, the median is the 50th percentile (where 50% of the data lies below this point and 50% lies above).
Because quantile
is a position from 0-100, the second argument is a float from 0.0 to 1.0. We’ll use that float to multiply with the length of the list. Then we’ll wrap in int
to create an integer index which we’ll use on a sorted xs to find the quantile.
def quantile(xs: List[float], p: float) -> float:
"""Returns the pth-percentile value in x"""
p_index = int(p * len(xs))
return sorted(xs)[p_index]
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13
Finally, we have the mode, which looks at the most common values. First, we use the Counter
method on our list parameter and since Counter is a subclass of dict
we have access to methods like values()
to find all the values and items()
to find key value pairs.
We define max_count
to find the max value (22), then the function returns a list comprehension which loops through counts.items()
to find the key associated with the max_count (22). That is 1 and 6, meaning twenty-two people (the mode) had one or six friends.
def mode(x: List[float]) -> List[float]:
"""Returns a list, since there might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.items() if count == max_count]
assert set(mode(num_friends)) == {1, 6}
Because we had already used Counter on num_friends
previously (see friend_counts
), we could have just called the most_common(2)
method to get the same results:
mode(num_friends) # [6, 1]
friend_counts.most_common(2) # [(6, 22), (1, 22)]
Dispersion
Aside from our data’s central tendencies, we’ll also want to understand it’s spread or dispersion. The tools to do this are data_range
, variance
, standard deviation
and interquartile range
.
Range is a straightforward max value minus min value.
Variance measures how far a set of numbers is from their average value. What’s more interesting, for our purpose, is how we need to borrow the functions we had previously built in the linear algebra post to create the variance function.
If you look at its wikipedia page, variance is the squared deviation of a variable from its mean.
First, we’ll need to create the de_mean
function that takes a list of numbers and subtract from all numbers in the list, the mean value (this gives us the deviation from the mean).
Then, we’ll sum_of_squares
all those deviations, which means we’ll take all the values, multiply them with itself (square it), then add the values (and divide by length of the list minus one) to get the variance.
Recall that the sum_of_squares
is a special case of the dot
product function.
# variance
from typing import List
Vector = List[float]
# see vectors.py in chapter 4 for dot and sum_of_squares
def dot(v: Vector, w: Vector) -> float:
"""Computes v_1 * w_1 + ... + v_n * w_n"""
assert len(v) == len(w), "vectors must be the same length"
return sum(v_i * w_i for v_i, w_i in zip(v,w))
def sum_of_squares(v: Vector) -> float:
"""Returns v_1 * v_1 + ... + v_n * v_n"""
return dot(v,v)
def de_mean(xs: List[float]) -> List[float]:
"""Translate xs by subtracting its mean (so the result has mean 0)"""
x_bar = mean(xs)
return [x - x_bar for x in xs]
def variance(xs: List[float]) -> float:
"""Almost the average squared deviation from the mean"""
assert len(xs) >= 2, "variance requires at least two elements"
n = len(xs)
deviations = de_mean(xs)
return sum_of_squares(deviations) / (n - 1)
assert 81.54 < variance(num_friends) < 81.55
The variance is sum_of_squares
deviations, which can be tricky to interpret. For example, we have a num_friends
with values ranging from 0 to 100.
What does a variance of 81.54 mean?
A more common alternative is the standard deviation. Here we take the square root of the variance using Python’s math
module.
With a standard deviation of 9.03, and we know the mean of num_friends
is 7.3, anything below 7 + 9 = 16 or 7 - 9 (0 friends) friends is still within a standard deviation of the mean. And we can check by running friend_counts
that most people are within a standard deviation of the mean.
On the other hand, we know that someone with 20 friends is more than one standard deviation away from the mean.
import math
def standard_deviation(xs: List[float]) -> float:
"""The standard deviation is the square root of the variance"""
return math.sqrt(variance(xs))
assert 9.02 < standard_deviation(num_friends) < 9.04
However, because the standard deviation builds on the variance, which is dependent on the mean, we know that just like the mean, it can be sensitive to outliers, we can use an alternative called the interquartile range, which is based on the median and less sensitive to outliers.
Specifically, the interquartile range can be used to examine num_friends
between the 25th and 75th percentile. A large chunk of people are going to have around 6 friends.
def interquartile_range(xs: List[float]) -> float:
"""Returns the difference between the 75%-ile and the 25%-ile"""
return quantile(xs, 0.75) - quantile(xs, 0.25)
assert interquartile_range(num_friends) == 6
Now that we describe a single list of data, we’ll also want to look at potential relationship between two data sources. For example, we may have a hypothesis that the amount of time spent on the DataScienster social network is somehow related to the number of friends someone has.
We’ll examine covariance and correlations next.
Correlation
If variance is how much a single set of numbers deviates from its mean (i.e., see de_mean
above), then covariance measures how two sets of numbers vary from their means. With the idea that if they co-vary the same amount, then they could be related.
Here we’ll borrow the dot
production function we developed in the
linear algebra post.
Moreover, we’ll examine if there’s a relationship between num_friends
and daily_minutes
and daily_hours
(see above).
def covariance(xs: List[float], ys: List[float]) -> float:
assert len(xs) == len(ys), "xs and ys must have same number of elements"
return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)
assert 22.42 < covariance(num_friends, daily_minutes) < 22.43
assert 22.42 / 60 < covariance(num_friends, daily_hours) < 22.43 / 60
As with variance, a similar critique can be made of covariance, you have to do extra steps to interpret it. For example, the covariance of num_friends
and daily_minutes
is 22.43.
What does that mean? Is that considered a strong relationship?
A more intuitive measure would be a correlation:
def correlation(xs: List[float], ys: List[float]) -> float:
"""Measures how much xs and ys vary in tandem about their means"""
stdev_x = standard_deviation(xs)
stdev_y = standard_deviation(ys)
if stdev_x > 0 and stdev_y > 0:
return covariance(xs,ys) / stdev_x / stdev_y
else:
return 0 # if no variation, correlation is zero
assert 0.24 < correlation(num_friends, daily_minutes) < 0.25
assert 0.24 < correlation(num_friends, daily_hours) < 0.25
By dividing out the standard deviation of both input variables, correlation is always between -1 (perfect (anti) correlation) and 1 (perfect correlation). A correlation of 0.24 is relatively weak correlation (although what is considered weak, moderate, strong depends on the context of the data).
One thing to keep in mind is simpson’s paradox or when the relationship between two variables change when accounting for a third, confounding variable. Moreover, we should keep this cliché in mind (it’s a cliché for a reason): correlation does not imply causation.
Summary
We are just five chapters in and we can begin to see how we’re building the tools now, that we’ll use later on. Here’s a visual summary of what we’ve covered in this post and how it connects to previous posts, namely linear algebra and the python crash course.
For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.