Data Science from Scratch (ch1)

Frequently Used Python Operations

Table of Content:

DataScienster_pt1

Collections and Comprehensions

Data Science from Scratch opens with a narrative motivating example where you, dear reader, are newly hired to lead data science at DataSciencester, a social network exclusively for data scientists.

Joel Grus, the author, explains:

Throughout the book, we’ll be learning about data science concepts by solving problems that you encounter at work. Sometimes we’ll look at data explicitly supplied by users, sometimes we’ll look at data generated through their interactions with the site, and sometimes we’ll even look at data from experiments that we’ll design…we’ll be building our tools from scratch.

This chapter is meant as a teaser for the rest of the book, but I wanted to revisit this chapter with our python crash course fresh on our minds to highlight some frequently used concepts we can expect to see for the rest of the book.

You are just hired as “VP of Networking” and are tasked with finding out which data scientist is the most well connected in the DataSciencster network, you’re giving a data dump 👇. It’s a list of users, each with a unique id.

users = [
    {"id": 0, "name": "Hero"},
    {"id": 1, "name": "Dunn"},
    {"id": 2, "name": "Sue"},
    {"id": 3, "name": "Chi"},
    {"id": 4, "name": "Thor"},
    {"id": 5, "name": "Clive"},
    {"id": 6, "name": "Hicks"},
    {"id": 7, "name": "Devin"},
    {"id": 8, "name": "Kate"},
    {"id": 9, "name": "Klein"}
]

Of note here is that the users variable is a list of dict (dictionaries).

Moving along, we also receive “friendship” data. Of note here that this is a list of tuples:

friendship_pairs = [(0,1), (0,2), (1,2), (1,3), (2,3), (3,4),
                    (4,5), (5,6), (5,7), (6,8), (7,8), (8,9)]

I had initially (and erroneously) thought of list, dict and tuple as data types (like int64, float64, string).

They’re rather collections, and somewhat unique to Python and more importantly, informs the way Pythonistas approach and solve problems.

You may feel that having “friendship” data in a list of tuple is not the easiest way to work with data (nor may it be the best way to represent data, but we’ll suspend those thoughts for now). Our first task is to convert this list of tuple into a form that’s more workable; the author proposes we turn it into a dict where the keys are user_ids and the values are list of friends.

The argument is that its faster to look things up in a dict rather than a list of tuple (where we’d have to iterate over every tuple). Here’s how we’d do that:

# Initialize the dict with an empty list for each user id
friendships = { user["id"]: [] for user in users }

# Loop over friendship pairs 
# This operation grabs the first, then second integer in each tuple
# It then appends each integer to the newly initialized friendships dict
for i, j in friendship_pairs:
    friendships[i].append(j)
    friendships[j].append(i)

We’re initializing a dict (called friendships), then looping over friendship_pairs to populate friendships. This is the outcome:

friendships

{
 0: [1, 2],
 1: [0, 2, 3],
 2: [0, 1, 3],
 3: [1, 2, 4],
 4: [3, 5],
 5: [4, 6, 7],
 6: [5, 8],
 7: [5, 8],
 8: [6, 7, 9],
 9: [8]
}

Each key in friendships is matched with a value that is initially an empty list, which then gets populated as we loop over friendship_pairs and systematically append the user_id that is paired together.

To understand how the looping happends and, specifically how each pair of user_ids are connected to each other, I created my own mini-toy example. Let’s say we’re just going to focus on looping through friendship_pairs for the user Hero whose id is 0.

# we'll set hero to an empty list
hero = []

# for every friendship_pair, if the first integer is 0, which is Hero's id,
# then append the second integer
for x, y in friendship_pairs:
    if x == 0:
        hero.append(y)
        
# outcome: we can confirm that Hero is connected to  Dunn and Sue
hero # [1,2]

The above gave me better intuition for how this works:

for i, j in friendship_pairs:
    friendships[i].append(j)  # Add j as a friend of user i
    friendships[j].append(i)  # Add i as a friend of user j

Here are some other questions we may be interested in:

What is the total number of connections?

Look at how the problem is solved. What’s notable to me is how we first define a function number_of_friends(user) that returns the number of friends for a particular user.

Then, total_connections is calculated using a comprehension (tuple?):

def number_of_friends(user):
    """How many friends does _user_ have?"""
    user_id = user["id"]
    friend_ids = friendships[user_id]
    return len(friend_ids)

total_connections = sum(number_of_friends(user) for user in users)

To be clear, the (tuple) comprehension is a pattern where a function is applied over a for-loop, in one line:

# (2, 3, 3, 3, 2, 3, 2, 2, 3, 1)
tuple((number_of_friends(user) for user in users))

# you can double check by calling friendships dict and counting the number of friends each user has
friendships

{
 0: [1, 2],
 1: [0, 2, 3],
 2: [0, 1, 3],
 3: [1, 2, 4],
 4: [3, 5],
 5: [4, 6, 7],
 6: [5, 8],
 7: [5, 8],
 8: [6, 7, 9],
 9: [8]
}

This pattern of using a one-line for-loop (aka comprehension) will come up often. If we add up all the connections, we get 24 and to find the average, we simply divide by the number of users (10) for 2.4, this part is straight-forward.

Can we sort who has most-to-least friends to find the most connected individuals?

To answer this question, again, a list comprehension is used. The cool thing is that we re-use functions we had previously created (number_of_friends(user)).

# Create a list that loops over users dict, applying a previously defined function
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]

# Then sort
num_friends_by_id.sort(                                 # Sort the list
    key=lambda id_and_friends: id_and_friends[1],       # by number friends
    reverse=True)                                       # descending order

We have just identified how central an individual is to the network, and we can expect to explore degree centrality and networks more in future chapters, but for the purposes of this post, we have identified the central role that collections (lists, dictionaries, tuples) as well as comprehensions play in Python operations.

In the next post, we’ll examing how friendship connections may or may not overlap with interests.

DataScienster_pt2

In the previous section, we began examining a toy data set see what kind of Python concepts from the crash course we’d see in action.

What stands out is the use of collections and comprehension. We’ll see this trend continue as data is given to us in the form of a list of dict or tuples.

Often time, we’re manipulating the data to make it faster and more efficient to iterate through the data. The tool that comes up quite often is using defaultdict to initialize an empty list. Followed by list comprehensions to iterate through data.

Indeed, either we’re seeing how the author, specifically, approaches problem or how problems are approached in Python, in general.

What I’m keeping in mind is that there are more than one way to approach data science problems and this is one of them.

With that said, let’s pick up where the previous section left off.

Friends you may know

We have a sense of the total number of connections and a sorting of the most connected individuals. Now, we may want to design a “people you may know” suggester.

Quick recap, here’s what the friendship dictionary looks like.

friendships

{
 0: [1, 2],
 1: [0, 2, 3],
 2: [0, 1, 3],
 3: [1, 2, 4],
 4: [3, 5],
 5: [4, 6, 7],
 6: [5, 8],
 7: [5, 8],
 8: [6, 7, 9],
 9: [8]
}

Again, the first step is to iterate over friends and collect friends’ friend. The following function returns a list comprehension. Let’s examine this function line-by-line to understand how it works. It returns friend_of_a_friend (foaf) id for each of the individuals’ id, then grabing the id of their friends.

We’ll break it down in code below this function:

def foaf_ids_bad(user):
    """foaf is short for 'friend of a friend' """
    return [foaf_id
            for friend_id in friendships[user["id"]]
            for foaf_id in friendships[friend_id]]

# Let's take Hero, to see Hero's friends 
# we'll call the first key of the friendships dict
# Hero has two friends with ids 1 and 2
friendships[0]  # [1,2]

# then we'll loop over *each* of the friends
friendships[1]  # [0, 2, 3]
friendships[2]  # [0, 1, 3]

# assert that function works
assert foaf_ids_bad(users[0]) == [0, 2, 3, 0, 1, 3]

Can we count mutual friends?

To answer this we’ll use a Counter, which we learned is a dict subclass. Moreover, the function friends_of_friends(user),

from collections import Counter

def friends_of_friends(user):
    user_id = user["id"]
    return Counter(
        foaf_id
        for friend_id in friendships[user_id]    # for each of my friends,
        for foaf_id in friendships[friend_id]    # find their friends
        if foaf_id != user_id                    # who aren't me
        and foaf_id not in friendships[user_id]  # and aren't my friends
    )

# lets look at Hero
# he has two common friends with Chi 
# Chi is neither Hero nor his direct friends
friends_of_friends(users[0])  # Counter({3: 2})

In addition to friendship data, we also have interest data. Here we see a list of tuples, containing a user_id and a string representing a specific of technology.

interests = [
    (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
    (0, "Spark"), (0, "Storm"), (0, "Cassandra"),
    (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
    (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
    (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
    (3, "statistics"), (3, "regression"), (3, "probability"),
    (4, "machine learning"), (4, "regression"), (4, "decision trees"),
    (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
    (5, "Haskell"), (5, "programming langauges"), (6, "statistics"),
    (6, "probability"), (6, "mathematics"), (6, "theory"),
    (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
    (7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
    (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
    (9, "Java"), (9, "MapReduce"), (9, "Big Data")
    ]

First thing we’ll do is find users with a specific interest. This is function returns a list comprehension. It first split each tuple into user_id (integer) and user_interest (string), then conditionally check if the string in the tuple matches the input parameter.

def data_scientists_who_like(target_interest):
    """Find the ids of all users who like the target interests."""
    return [user_id
            for user_id, user_interest in interests
            if user_interest == target_interest]
            
# let's see all user_id who likes "statistics"
data_scientists_who_like("statistics")   # [3, 6]

We may also want to count the number of times a specific interest comes up. Here’s a function for that. We use a basic for-loop and if-statement to check truthiness of user_interest == target_interest.

def num_user_with_interest_in(target_interest):
    interest_count = 0
    for user_id, user_interest in interests:
        if user_interest == target_interest:
            interest_count += 1
    return interest_count

A concern is having to examine a whole list of interests for every search. The author proposes building an index from interests to users. Here, a defaultdict is imported, then populated with user_id

from collections import defaultdict

# user_ids matched to specific interest
user_ids_by_interest = defaultdict(list)

for user_id, interest in interests:
    user_ids_by_interest[interest].append(user_id)

# three users interested in Python
assert user_ids_by_interest["Python"] == [2,3,5]

# list of interests by user_id
interests_by_user_id = defaultdict(list)

for user_id, interest in interests:
    interests_by_user_id[user_id].append(interest)

# check all of Hero's interests
assert interests_by_user_id[0] == ['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']

We can find who has the most interests in common with a given user. Looks like Klein (#9) has the most common interests with Hero (#0). Here we return a Counter with for-loops and an if-statement.

def most_common_interests_with(user):
    return Counter(
        interested_user_id
        for interest in interests_by_user_id[user["id"]]
        for interested_user_id in user_ids_by_interest[interest]
        if interested_user_id != user["id"]
        )
        
# let's check to see who has the most common interest with Hero
most_common_interests_with(users[0]) # Counter({9: 3, 8: 1, 1: 2, 5: 1})

Finally, we can also find which topics are most popular among the network. Previously, we calculated the number of users interested in a particular topic, but now we want to compare the whole list.

words_and_counts = Counter(word
                           for user, interest in interests
                           for word in interest.lower().split())

Salaries and Experience Data

We’re also given anonymous salary and tenure (number of years work experience) data, let’s see what we can do with that information. First we’ll find the average salary. Again, we’ll start by creating a list (defaultdict), then loop through salaries_and_tenures.

salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
                        (48000, 0.7), (76000, 6),
                        (69000, 6.5), (76000, 7.5),
                        (60000, 2.5), (83000, 10),
                        (48000, 1.9), (63000, 4.2)]
                        
salary_by_tenure = defaultdict(list)

for salary, tenure in salaries_and_tenures:
    salary_by_tenure[tenure].append(salary)

# find average salary by tenure
average_salary_by_tenure = {
    tenure: sum(salaries) / len(salaries)
    for tenure, salaries in salary_by_tenure.items()
    }

The problem is that this is not terribly informative as each tenure value has a different salary. Not even the average_salary_by_tenure is informative, so our next move is to group similar tenure values together.

First, we’ll create the groupings/categories using a control-flow, then we’ll create a list(defaultdict), and loop through salaries_and_tenures to populate the newly created salary_by_tenure_bucket. Finally calculate the average.

def tenure_bucket(tenure):
    if tenure < 2:
        return "less than two"
    elif tenure < 5:
        return "between two and five"
    else:
        return "more than five"
        
salary_by_tenure_bucket = defaultdict(list)

for salary, tenure in salaries_and_tenures:
    bucket = tenure_bucket(tenure)
    salary_by_tenure_bucket[bucket].append(salary)
    
# finally calculate average
average_salary_by_bucket = {
    tenure_bucket: sum(salaries) / len(salaries)
    for tenure_bucket, salaries in salary_by_tenure_bucket.items()
    }

One thing to note is that the “given” data, in this hypothetical toy example is either in a list of dictionaries or tuples, which may be atypical if we’re used to working with tabular data in dataFrame (pandas) or native data.frame in R.

Again, we are reminded that the higher purpose of this book - Data Science from Scratch (by Joel Grus; 2nd Ed) is to eschew libraries in favor of plain python to build everything from the ground up.

Should your goal be to learn how various algorithms work by building them up from scratch, and in the process learn how data problems can be solved with python and minimal libraries, this is your book.

Joel Grus does make clear that you would use libraries and frameworks (pandas, scikit-learn, matplotlib etc), rather than coded-from-scratch algorithms when working in production environments and will point out resource for further reading at the end of the chapters.

In the next post, we’ll get into visualizing data.

Paul Apivat
Paul Apivat
CryptoData Analyst ⛓️

Onchain data, DeFi, pipelines.

Related