# Making sense of matplotlib

Data Visualization with Matplotlib

## Data Visualization Chapter 3 of Data Science from Scratch introduces us to visualizing data using matplotlib. This is widely used in the Python ecosystem, although my sense is that people are just as happy, if not more, to use other libraries like seaborn, Altair and bokeh. (note: seaborn is built on top of matplotlib).

This chapter is fairly brief and is meant as a quick introduction to matplotlib - to get readers familiar with basic charts. Whole books can be written on data visualization alone, so this is meant more as an appetizer, rather than a full-course.

There’s a fair amount of detail involved in using matplotlib, so we’ll break it down to demystify it.

#### Basic Plotting

This chapter goes through the main basic charts including Line, Bar, Histograms, and Scatter Plots. At first glance, they follow a similar pattern. Data is provided as a `list` of numbers (usually more than one list). `pyplot` is imported from `matplotlib` as `plt`. The `plt` module has several functions which are accessed to create the plot.

Here’s an example line chart visualizing growth in GDP over time: Here’s the code:

``````
from matplotlib import pyplot as plt

# the data
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]

# the plot
plt.plot(years, gdp, color="green", marker='o', linestyle='solid')
plt.title("Nominal GDP")
plt.ylabel("Billions of \$")
plt.xlabel("Years")
plt.show()

``````

You can somewhat get by with just knowing this. Briefly consulting the documentation will let you see some other chart types like so: Let’s say we wanted to convert our line chart into a stacked area chart, we can just change one line:

``````from matplotlib import pyplot as plt

years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]

plt.stackplot(years, gdp, color="green") # this is the only line we changed

plt.title("Nominal GDP")
plt.ylabel("Billions of \$")
plt.xlabel("Years")
plt.show()

``````

Here’s what the stacked area chart version of the previous graph looks like: To keep things simple, we can change the chart type with just one line and we just need to remember that when converting from chart to chart, we have to be mindful of the parameters that each chart type takes. For example, a stacked area chart takes in different parameters than line charts (for example, you’ll get an `AttributionError` if you try to use `marker` in a stacked area chart.)

Here’s an example bar chart comparing movies by the number of Academy awards they’ve won: Here’s a stem plot version: As with the previous example, changing just one function from `plt.bar` to `plt.stem` gave us a different plot:

``````#---- Original Bar Chart ----#

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5,11,3,8,10]

plt.bar(range(len(movies)), num_oscars)
plt.title("My Favorite Movies")
plt.xticks(range(len(movies)), movies)
plt.show()

# ---- Stem Chart ---- #

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5,11,3,8,10]

plt.stem(range(len(movies)), num_oscars) # the only change
plt.title("My Favorite Movies")
plt.xticks(range(len(movies)), movies)
plt.show()
``````

#### There are levels to this: Hierarchy

I’m all for keeping matplotlib as simple as possible but one thing the above examples gloss over is the matplotlib object hierarchy, which is something worth understanding to get a feel for how the various functions operate.

This next figure is borrowed from Real Python and it nicely highlights the hierarchy inherent in every plot: You’ll note the levels: Figure, Axes and Axis. When digging into matplotlib documentation on axes, these levels are brought to the foreground.

To really see this in action, we’ll need to code our plot slightly differently. For the last chart this chapter examines the bias-variance tradeoff which is something we’ll learn more about in future chapters, but it highlights the trade-off in trying to simultanenously minimize two sources of error so our algorithm generalizes to new situations.

Here’s the code:

``````# BOOK version
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x,y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]

plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.show()

# ALTERNATE version
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x,y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]

fig, ax = plt.subplots(figsize=(8,5))
ax.plot(xs, variance, 'g-', label='variance')
ax.plot(xs, bias_squared, 'r-', label='bias^2')
ax.plot(xs, total_error, 'b:', label='total error')
ax.legend(loc='upper center')
ax.set_xlabel("model complexity")
fig.tight_layout()
fig.show()

``````

Instead using the `plt` module, we use `fig` and `ax`, here are their data types:

``````type(fig) # matplotlib.figure.Figure
type(ax)  # matplotlib.axes._subplots.AxesSubplot
``````

This makes explicit the matplotlib object hierarchy, particularly as we see how we access function at the `axes._subplits.AxesSubplot` level (the documentation has much more detail).

Here’s the chart: In summary, we learned that matplotlib can be fairly simple to use for static, simple plots, but we’re better served having some understanding of matplotlib’s object hierarchy. We’ll examine more chart types as we proceed with the rest of the chapters. ##### Paul Apivat
###### web3 data

My interests include data science, machine learning and R/Python programming.