Calculating 25th, 50th and 75th Percentile of Column Values
Calculating Percentiles
When we have a list of values in a column, how can we determine which values are under/over the 25th percentile, 50th percentile or 75th percentile?
Here the example are countries’ average percentages of the population with, broadly speaking, ICT Skills as determine by the Sustainable Development Goals, Indicator 4.4.1.
There are three methods. First, manually calculating values for the 25th, 50th and 75th percentile with the quantile()
function.
# Country mean_values at 25th, 50th and 75th percentile
data %>%
select(GeoAreaName, Value, Sex, `Type of skill`, TimePeriod, Units) %>%
rename(type_of_skill = `Type of skill`) %>%
mutate(Value = as.numeric(Value)) %>%
group_by(GeoAreaName) %>%
summarize(
mean_value = mean(Value)
) %>%
mutate(
min_mean = min(mean_value),
iqr_25_percentile = quantile(mean_value, probs = c(0.25)),
iqr_50_percentile = quantile(mean_value, probs = c(0.50)),
iqr_75_percentile = quantile(mean_value, probs = c(0.75)),
max_mean = max(mean_value)
) %>%
arrange(desc(mean_value))
The second approach is to use the ntile()
function:
# Creating bins using ntile()
data %>%
select(GeoAreaName, Value, Sex, `Type of skill`, TimePeriod, Units) %>%
rename(type_of_skill = `Type of skill`) %>%
mutate(Value = as.numeric(Value)) %>%
group_by(GeoAreaName) %>%
summarize(
mean_value = mean(Value)
) %>%
mutate(
mean_value_binned = ntile(mean_value, 4)
) %>%
arrange(desc(mean_value))
The third approach uses the purrr
package and the partial
function that can be used with dplyr's
summarize_at()
function. Check out the
source
## Using purrr
library(purrr)
p <- c(0.25, 0.50, 0.75)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
p_funs
data %>%
select(GeoAreaName, Value, Sex, `Type of skill`, TimePeriod, Units) %>%
rename(type_of_skill = `Type of skill`) %>%
mutate(Value = as.numeric(Value)) %>%
group_by(GeoAreaName) %>%
summarize(
mean_value = mean(Value)
) %>%
summarize_at(vars(mean_value), funs(!!!p_funs))