Going beyond summary statistics
A course supplement
Datasaurus Introduction
I recently came across the Datasaurus dataset by Alberto Cairo on #TidyTuesday and wanted to create a series of charts illustrating the lessons associated with this dataset, primarily to: never trust summary statistics alone.
First, some context. Here’s Alberto’s original tweet from years ago when he created this dataset:
This tweet alone doesn’t communicate why we shouldn’t trust summary statistics alone, so let’s unpack this. First we’ll load the various packages and data we’ll use.
Load Packages
library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggcorrplot)
library(ggridges)
Load Data
note : datasaurus
and datasaurus_dozen
are identical. The former is provided via #TidyTuesday, the latter from this research paper discussing more advanced concepts beyond the scope of this document (i.e., simulated annealing).
You’ll also note that datasaurus_dozen
and datasaurus_wide
are the same data, organized differently. The former in long format and the latter, in wide format - see here for details.
For the most part, we’ll use datasaurus_dozen
throughout this document. We’ll use datasaurus_wide
when we get to the correlation section.
datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')
## Parsed with column specification:
## cols(
## dataset = col_character(),
## x = col_double(),
## y = col_double()
## )
datasaurus_dozen <- read_tsv('./data/DatasaurusDozen.tsv')
## Parsed with column specification:
## cols(
## dataset = col_character(),
## x = col_double(),
## y = col_double()
## )
datasaurus_wide <- read_tsv('./data/DatasaurusDozen-wide.tsv')
## Warning: Duplicated column names deduplicated: 'away' => 'away_1' [2],
## 'bullseye' => 'bullseye_1' [4], 'circle' => 'circle_1' [6], 'dino' =>
## 'dino_1' [8], 'dots' => 'dots_1' [10], 'h_lines' => 'h_lines_1' [12],
## 'high_lines' => 'high_lines_1' [14], 'slant_down' => 'slant_down_1' [16],
## 'slant_up' => 'slant_up_1' [18], 'star' => 'star_1' [20], 'v_lines'
## => 'v_lines_1' [22], 'wide_lines' => 'wide_lines_1' [24], 'x_shape' =>
## 'x_shape_1' [26]
## Parsed with column specification:
## cols(
## .default = col_character()
## )
## See spec(...) for full column specifications.
Eyeballing the data
Here are the first six rows of datasaurus_dozen
(long):
## # A tibble: 6 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
Here are the first six rows of datasaurus_wide
(wide):
## # A tibble: 6 x 26
## away away_1 bullseye bullseye_1 circle circle_1 dino dino_1 dots dots_1
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x y x y x y x y x y
## 2 32.3… 61.41… 51.2038… 83.339776… 55.99… 79.2772… 55.3… 97.17… 51.1… 90.86…
## 3 53.4… 26.18… 58.9744… 85.499817… 50.03… 79.0130… 51.5… 96.02… 50.5… 89.10…
## 4 63.9… 30.83… 51.8720… 85.829737… 51.28… 82.4359… 46.1… 94.48… 50.2… 85.46…
## 5 70.2… 82.53… 48.1799… 85.045116… 51.17… 79.1652… 42.8… 91.41… 50.0… 83.05…
## 6 34.1… 45.73… 41.6832… 84.017940… 44.37… 78.1646… 40.7… 88.33… 50.5… 82.93…
## # … with 16 more variables: h_lines <chr>, h_lines_1 <chr>, high_lines <chr>,
## # high_lines_1 <chr>, slant_down <chr>, slant_down_1 <chr>, slant_up <chr>,
## # slant_up_1 <chr>, star <chr>, star_1 <chr>, v_lines <chr>, v_lines_1 <chr>,
## # wide_lines <chr>, wide_lines_1 <chr>, x_shape <chr>, x_shape_1 <chr>
There are 13 variables, each with X- and Y- axes.
Summary Statistics
First, we’ll note that if we just look at summary statistics (i.e., mean and standard deviation), we might conclude that these variables are all the same. Moreover, within each variable, x
and y
values have very similarly low correlations at ranging from -0.06 to -0.07.
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(
x_mean = mean(x),
x_sd = sd(x),
y_mean = mean(y),
y_sd = sd(y),
corr = cor(x,y)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 13 x 6
## dataset x_mean x_sd y_mean y_sd corr
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 16.8 47.8 26.9 -0.0641
## 2 bullseye 54.3 16.8 47.8 26.9 -0.0686
## 3 circle 54.3 16.8 47.8 26.9 -0.0683
## 4 dino 54.3 16.8 47.8 26.9 -0.0645
## 5 dots 54.3 16.8 47.8 26.9 -0.0603
## 6 h_lines 54.3 16.8 47.8 26.9 -0.0617
## 7 high_lines 54.3 16.8 47.8 26.9 -0.0685
## 8 slant_down 54.3 16.8 47.8 26.9 -0.0690
## 9 slant_up 54.3 16.8 47.8 26.9 -0.0686
## 10 star 54.3 16.8 47.8 26.9 -0.0630
## 11 v_lines 54.3 16.8 47.8 26.9 -0.0694
## 12 wide_lines 54.3 16.8 47.8 26.9 -0.0666
## 13 x_shape 54.3 16.8 47.8 26.9 -0.0656
Boxplots
You could use boxplots
to show slight variation in the distribution and median values of these 13 variables. However, the mean values, indicated with the red circles, are identical.
datasaurus_dozen %>%
ggplot(aes(x = dataset, y = x, fill = dataset)) +
geom_boxplot(alpha = 0.6) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
scale_fill_brewer(palette = "Set3") +
theme_classic() +
theme(legend.position = 'none') +
labs(
y = '13 variables',
x = 'X-values',
title = "Boxplots: Slight differences in the distribution and median values (X-axis)",
subtitle = "Identical mean values"
)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors
Here’s the same plot for y
values:
datasaurus_dozen %>%
ggplot(aes(x = dataset, y = y, fill = dataset)) +
geom_boxplot(alpha = 0.6) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 6, color = "red", fill = "red") +
scale_fill_brewer(palette = "Paired") +
theme_classic() +
theme(legend.position = 'none') +
labs(
y = '13 variables',
x = 'Y-values',
title = "Boxplots: Slight differences in the distribution and median values (Y-axis)",
subtitle = "Identical mean values"
)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Paired is 12
## Returning the palette you asked for with that many colors
Ridgeline Plot
We can begin to get a sense for how these variables are different if we plot the distribution in different ways. The ridgeline plot begins to reveal aspects of the data that were hidden before.
We can begin to see that certain variables have markedly different distribution shapes (i.e., v_lines
, dots
, x_shape
, wide_lines
), while having the same mean value.
datasaurus_dozen %>%
ggplot(aes(x = x, y = dataset, fill = dataset)) +
geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = 'none') +
labs(
x = "X-values",
y = "13 variables",
title = "Ridgeline Plot: More variation in the distribution (X-axis)",
subtitle = "Identical mean values"
)
## Picking joint bandwidth of 5.46
For y
values, high_lines
, dots
, circle
and star
have obviously different distributions from the rest. Again, the mean values are identical across variables.
datasaurus_dozen %>%
ggplot(aes(x = y, y = dataset, fill = dataset)) +
geom_density_ridges_gradient(scale = 3, quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = 'none') +
labs(
x = "Y-values",
y = "13 variables",
title = "Ridgeline Plot: More variation in the distribution (Y-axis)",
subtitle = "Identical mean values"
)
## Picking joint bandwidth of 9
Correlations
If you skip visualizing the distribution and central tendencies and go straight to seeing how the variables correlate with each other, you could also miss some fundamental differences in the data.
In particular, the x
and y
values across all 13 variables are highlight correlated. With just knowledge of the summary statistics, one could be led to believe that these variables are highly similar.
Below is an abbreviated correlation matrix.
library(ggcorrplot)
# X-values
# selecting rows 2-143
# turning all values from character to numeric
datasaurus_wide_x <- datasaurus_wide %>%
slice(2:143) %>%
select(away, bullseye, circle, dino, dots, h_lines, high_lines, slant_down, slant_up, star, v_lines, wide_lines, x_shape) %>%
mutate_if(is.character, as.numeric)
# Y-values
# selecting rows 2-143
# turning all values from character to numeric
datasaurus_wide_y <- datasaurus_wide %>%
slice(2:143) %>%
select(away_1, bullseye_1, circle_1, dino_1, dots_1, h_lines_1, high_lines_1, slant_down_1, slant_up_1, star_1, v_lines_1, wide_lines_1, x_shape_1) %>%
mutate_if(is.character, as.numeric)
# correlation matrix for X values
corr_x <- round(cor(datasaurus_wide_x), 1)
# correlation matrix for Y values
corr_y <- round(cor(datasaurus_wide_y), 1)
head(corr_x[, 1:6])
## away bullseye circle dino dots h_lines
## away 1.0 -0.3 -0.3 -0.3 -0.3 -0.3
## bullseye -0.3 1.0 0.9 0.9 0.9 0.9
## circle -0.3 0.9 1.0 0.9 0.8 0.9
## dino -0.3 0.9 0.9 1.0 0.9 1.0
## dots -0.3 0.9 0.8 0.9 1.0 0.9
## h_lines -0.3 0.9 0.9 1.0 0.9 1.0
Visualizing the correlation matrix
Here is a correlation between the x-values
between all 13 variables. You can see that all variables, aside from away
, are highly correlated with each other.
# correlation between X-values
ggcorrplot(corr_x, hc.order = TRUE,
type="lower",
outline.color = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#d8b365", "#f5f5f5", "#5ab4ac"),
lab = TRUE)
Here is a correlation between the ‘y-values’ between all 13 variables. Again, aside from away
, all the variables are highly correlated with each other.
# correlation between Y-values
ggcorrplot(corr_y, hc.order = TRUE,
type="lower",
outline.color = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#ef8a62", "#f7f7f7", "#67a9cf"),
lab = TRUE)
Facets
At this point, the boxplots show us variables with similar median and identical mean; the ridgelines begin to show us that some variables have different distributions. And the correlation matrix suggests the variables are more similar than not.
To really see their differences, we’ll need to use facet_wrap
.
Here we’ll use facet_wrap
to examine the histogram for x
and y
values of all 13 variables. We started to see the differences in distribution between variables from the ridgeline
plots, but overlapping histograms provide another perspective.
# facet histogram (both-values)
datasaurus_dozen %>%
group_by(dataset) %>%
ggplot() +
geom_histogram(aes(x=x, fill='red'), alpha = 0.5, bins = 30) +
geom_histogram(aes(x=y, fill='green'), alpha = 0.5, bins = 30) +
facet_wrap(~dataset) +
scale_fill_discrete(labels = c('y', 'x')) +
theme_classic() +
labs(
fill = 'Axes',
x = '',
y = 'Count',
title = 'Faceted Histogram: x- and y-values'
)
Scatter Plot
However, if there’s one thing this dataset is trying to communicate its that there’s no subtitute for plotting the actual data points. No amount of summary statistics, central tendency or distribution is going to replace plotting actually data points.
Once we create the scatter plot with geom_point
, we see the big reveal with this dataset. That despite the similarities in central measures, for the most part similar distributions and high correlations, the 13 variables are wildly different from each other.
datasaurus_dozen %>%
group_by(dataset) %>%
ggplot(aes(x=x, y=y, color=dataset)) +
geom_point(alpha = 0.5) +
facet_wrap(~dataset) +
scale_color_manual(values = c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99', '#b15928', 'grey')) +
theme_classic() +
theme(legend.position = "none") +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Faceted Scatter Plot'
)
There are other less common alternatives to the scatter plot.
Geom Density 2D
While not as clear as the scatter plot, plotting the contours of a 2D density estimate does show how very different the variables are from each other, despite similar summary statistics.
# contours of a 2D Density estimate
datasaurus_dozen %>%
ggplot(aes(x=x, y=y)) +
geom_density_2d() +
theme_classic() +
facet_wrap(~dataset) +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Contours of a 2D density estimate'
)
This is a slight variation using stat_density_2d
:
# stat density 2d
datasaurus_dozen %>%
ggplot(aes(x=x, y=y)) +
stat_density_2d(aes(fill=y), geom = "polygon", colour = 'white') +
theme_classic() +
facet_wrap(~dataset) +
labs(
x = 'X-axis',
y = 'Y-axis',
title = 'Stat Density 2D estimate'
)
Using the density_2d
plots are quite effective in showing how different the variables are and serve as a nice alternative to the more familiar scatter plot.
Hopefully this vignette illustrates the importance of never trusting summary statistics (alone). Moreover, when visualizing, we should go beyond simply visualizing the data’s distribution or central tendency, but plotting the actually data points.