Data Science from Scratch (ch5) - Statistics

Descriptive Statistics & Correlations

Table of contents

Overview

This post is chapter 5 in continuation of my coverage of Data Science from Scratch by Joel Grus.

It should be noted upfront that everything covered in this post can be done more expediently and efficiently in libraries like NumPy as well as the statistics module in Python.

The primary value of this book, and by extension this post, in my opinion, is the emphasis on learning how Python primitives can be used to build tools from the ground up.

Specifically, we’ll examine how specific features of the Python language as well as functions we built in a previous post on linear algebra can be used to build tools used to describe data and relationships within data (aka statistics).

I think this is pretty cool. Hopefully you agree.

Example Data

This chapter continues the narrative of you as a newly hired data scientist at DataScienster, the social network for data scientists, and your job is to describe how many friends members in this social network has. We have two lists of float to work with. We’ll work with num_friends first, then daily_minutes later.

I wanted this post to be self-contained, and in order to do that we’ll have to read in a larger than average list of floats. The alternative would be to get the data directly from the book’s github repo (statistics.py)

num_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]

daily_hours = [dm / 60 for dm in daily_minutes]

Describing

The num_friends list is a list of numbers representing “number of friends” a person has, so for example, one person has 100 friends. The first thing we do to describe the data is to create a bar chart plotting the number of people who have 100 friends, 49 friends, 41 friends, and so on.

We’ll import Counter from collections and import matplotlib.pyplot.

We’ll use Counter to turn num_friends list into a defaultdict(int)-like object mapping keys to counts. For more info, please refer to this previous post on the Counters.

Once we use the Counter collection, a high-performance container datatype, we can use methods like most_common to find the keys with the most common values. Here we see that the five most common number of friends are 6, 1, 4, 3 and 9, respectively.

from collections import Counter

import matplotlib.pyplot as plt

friend_counts = Counter(num_friends)

# the five most common values are: 6, 1, 4, 3 and 9 friends
# [(6, 22), (1, 22), (4, 20), (3, 20), (9, 18)]
friend_counts.most_common(5) 

To proceed with plotting, we’ll use friend_counts to create a list comprehension that will loop through friends_count and for all keys from 0-101 (xs) and print a corresponding value (if it exists). This becomes the y-axis to num_friends, which is the x-axis:

xs = range(101)                     # x-axis: largest num_friend value is 100
ys = [friend_counts[x] for x in xs] # y-axis
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()

Here is the plot below. You can see one person with 100 friends.

histo_friend_counts.png

You can also read more about data visualization here.

Alternatively, we could generate simple statistics to describe the data using built-in Python methods: len, min, max and sorted.

num_points = len(num_friends) # number of data points in num_friends: 204
largest_value = max(num_friends) # largest value in num_friends: 100
smallest_value = min(num_friends) # smallest value in num_friends: 1

sorted_values = sorted(num_friends) # sort the values in ascending order
second_largest_value = sorted_values[-2] # second largest value from the back: 49

Central Tendencies

The most common way of describing a set of data is to find it’s mean, which is the sum of all the values, divided by the number of values. note : we’ll continue to use type annotations. In my opinion, it helps you be a more deliberate and mindful Python programmer.

from typing import List

def mean(xs: List[float]) -> float:
    return sum(xs) / len(xs)
    
assert 7.3333 < mean(num_friends) < 7.3334

However, the mean is notoriously sensitive to outliers so statisticians often supplement with other measures of central tendencies like median. Because the median is the middle-most value, it matters whether there is an even or odd number of data points.

Here, we’ll create two private functions for both situations - even and odd number of data points - in calculating the median. First, we’ll sort the data values. Then, for even number values, we’ll find the two middle values and split them. For odd number of values, we’ll divide the length of the dataset by 2 (i.e., 50).

Our median function will return either of the private function _median_even or _median_odd conditionally depending on if the length of a list of numbers is divisible (%2==0) by 2.

def _median_even(xs: List[float]) -> float:
    """If len(xs) is even, it's the average of the middle two elements"""
    sorted_xs = sorted(xs)
    hi_midpoint = len(xs) // 2   # e.g. length 4 => hi_midpoint 2
    return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2
    
def _median_odd(xs: List[float]) -> float:
    """If len(xs) is odd, its the middle element"""
    return sorted(xs)[len(xs) // 2]
    
def median(v: List[float]) -> float:
    """Finds the 'middle-most' value of v"""
    return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
    
assert median([1,10,2,9,5]) == 5
assert median([1, 9, 2, 10]) == (2 + 9) / 2

Because the median is the middle-most value, it does not fully depend on every value in the data. For illustration, hypothetically if we have a another list num_friends2 where one person had 10,000 friends, the mean would be much more sensitive to that change than the median would be.

num_friends2 = [10000.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14
    ,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9
    ,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7
    ,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5
    ,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3
    ,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1
    ,1,1,1,1,1,1,1,1,1,1,1,1]
    
mean(num_friends2)   # more sensitive to outliers: 7.333 => 55.86274509803921
median(num_friends2) # less sensitive to outliers: 6.0 => 6.0

You may also used quantiles to describe your data. Whenever you’ve heard “X percentile”, that is a description of quantiles relative to 100. In fact, the median is the 50th percentile (where 50% of the data lies below this point and 50% lies above).

Because quantile is a position from 0-100, the second argument is a float from 0.0 to 1.0. We’ll use that float to multiply with the length of the list. Then we’ll wrap in int to create an integer index which we’ll use on a sorted xs to find the quantile.

def quantile(xs: List[float], p: float) -> float:
    """Returns the pth-percentile value in x"""
    p_index = int(p * len(xs))  
    return sorted(xs)[p_index]
    
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13

Finally, we have the mode, which looks at the most common values. First, we use the Counter method on our list parameter and since Counter is a subclass of dict we have access to methods like values() to find all the values and items() to find key value pairs.

We define max_count to find the max value (22), then the function returns a list comprehension which loops through counts.items() to find the key associated with the max_count (22). That is 1 and 6, meaning twenty-two people (the mode) had one or six friends.

def mode(x: List[float]) -> List[float]:
    """Returns a list, since there might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.items() if count == max_count]
    

assert set(mode(num_friends)) == {1, 6}

Because we had already used Counter on num_friends previously (see friend_counts), we could have just called the most_common(2) method to get the same results:

mode(num_friends) # [6, 1]
friend_counts.most_common(2) # [(6, 22), (1, 22)]

Dispersion

Aside from our data’s central tendencies, we’ll also want to understand it’s spread or dispersion. The tools to do this are data_range, variance, standard deviation and interquartile range.

Range is a straightforward max value minus min value.

Variance measures how far a set of numbers is from their average value. What’s more interesting, for our purpose, is how we need to borrow the functions we had previously built in the linear algebra post to create the variance function.

If you look at its wikipedia page, variance is the squared deviation of a variable from its mean.

First, we’ll need to create the de_mean function that takes a list of numbers and subtract from all numbers in the list, the mean value (this gives us the deviation from the mean).

Then, we’ll sum_of_squares all those deviations, which means we’ll take all the values, multiply them with itself (square it), then add the values (and divide by length of the list minus one) to get the variance.

Recall that the sum_of_squares is a special case of the dot product function.

# variance

from typing import List

Vector = List[float]

# see vectors.py in chapter 4 for dot and sum_of_squares

def dot(v: Vector, w: Vector) -> float:
    """Computes v_1 * w_1 + ... + v_n * w_n"""
    assert len(v) == len(w), "vectors must be the same length"
    return sum(v_i * w_i for v_i, w_i in zip(v,w))
    
def sum_of_squares(v: Vector) -> float:
    """Returns v_1 * v_1 + ... + v_n * v_n"""
    return dot(v,v)
    
def de_mean(xs: List[float]) -> List[float]:
    """Translate xs by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(xs)
    return [x - x_bar for x in xs]
    
def variance(xs: List[float]) -> float:
    """Almost the average squared deviation from the mean"""
    assert len(xs) >= 2, "variance requires at least two elements"
    n = len(xs)
    deviations = de_mean(xs)
    return sum_of_squares(deviations) / (n - 1)
    
assert 81.54 < variance(num_friends) < 81.55

The variance is sum_of_squares deviations, which can be tricky to interpret. For example, we have a num_friends with values ranging from 0 to 100.

What does a variance of 81.54 mean?

A more common alternative is the standard deviation. Here we take the square root of the variance using Python’s math module.

With a standard deviation of 9.03, and we know the mean of num_friends is 7.3, anything below 7 + 9 = 16 or 7 - 9 (0 friends) friends is still within a standard deviation of the mean. And we can check by running friend_counts that most people are within a standard deviation of the mean.

On the other hand, we know that someone with 20 friends is more than one standard deviation away from the mean.

import math

def standard_deviation(xs: List[float]) -> float:
    """The standard deviation is the square root of the variance"""
    return math.sqrt(variance(xs))
    
assert 9.02 < standard_deviation(num_friends) < 9.04

However, because the standard deviation builds on the variance, which is dependent on the mean, we know that just like the mean, it can be sensitive to outliers, we can use an alternative called the interquartile range, which is based on the median and less sensitive to outliers.

Specifically, the interquartile range can be used to examine num_friends between the 25th and 75th percentile. A large chunk of people are going to have around 6 friends.

def interquartile_range(xs: List[float]) -> float: 
    """Returns the difference between the 75%-ile and the 25%-ile"""
    return quantile(xs, 0.75) - quantile(xs, 0.25)
    
assert interquartile_range(num_friends) == 6

Now that we describe a single list of data, we’ll also want to look at potential relationship between two data sources. For example, we may have a hypothesis that the amount of time spent on the DataScienster social network is somehow related to the number of friends someone has.

We’ll examine covariance and correlations next.

Correlation

If variance is how much a single set of numbers deviates from its mean (i.e., see de_mean above), then covariance measures how two sets of numbers vary from their means. With the idea that if they co-vary the same amount, then they could be related.

Here we’ll borrow the dot production function we developed in the linear algebra post.

Moreover, we’ll examine if there’s a relationship between num_friends and daily_minutes and daily_hours (see above).

def covariance(xs: List[float], ys: List[float]) -> float:
    assert len(xs) == len(ys), "xs and ys must have same number of elements"
    return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)

assert 22.42 < covariance(num_friends, daily_minutes) < 22.43
assert 22.42 / 60 < covariance(num_friends, daily_hours) < 22.43 / 60

As with variance, a similar critique can be made of covariance, you have to do extra steps to interpret it. For example, the covariance of num_friends and daily_minutes is 22.43.

What does that mean? Is that considered a strong relationship?

A more intuitive measure would be a correlation:

def correlation(xs: List[float], ys: List[float]) -> float:
    """Measures how much xs and ys vary in tandem about their means"""
    stdev_x = standard_deviation(xs)
    stdev_y = standard_deviation(ys)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(xs,ys) / stdev_x / stdev_y
    else:
        return 0 # if no variation, correlation is zero

assert 0.24 < correlation(num_friends, daily_minutes) < 0.25
assert 0.24 < correlation(num_friends, daily_hours) < 0.25

By dividing out the standard deviation of both input variables, correlation is always between -1 (perfect (anti) correlation) and 1 (perfect correlation). A correlation of 0.24 is relatively weak correlation (although what is considered weak, moderate, strong depends on the context of the data).

One thing to keep in mind is simpson’s paradox or when the relationship between two variables change when accounting for a third, confounding variable. Moreover, we should keep this cliché in mind (it’s a cliché for a reason): correlation does not imply causation.

Summary

We are just five chapters in and we can begin to see how we’re building the tools now, that we’ll use later on. Here’s a visual summary of what we’ve covered in this post and how it connects to previous posts, namely linear algebra and the python crash course.

summary

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter.

Paul Apivat
Paul Apivat
onchain ⛓️ data

My interests include data science, machine learning and Python programming.

Related