Survey Data Cleaning
Survey Data Cleaning
NOTE: These tips are a distillation of data cleaning techniques I picked up in the course of cleaning data for the first DAO Creators Survey (Gitcoin x BanklessDAO).
The DAO Creators Survey was a two part survey sampling 442 and 256 respondents to approximately 50 survey questions ranging from demographics to web3 tooling, DAO compensation/healthcare and income stability, to name a few.
The questions ranged from highly structured (i.e., multiple choice, multiple response options and dropdown boxes) to highly unstructured (i.e., qualitative responses).
I created approximately 50 charts for this report and each chart presented unique data cleaning challenges. However, I will describe a base foundation and areas of overlap so the next project is easier.
Two articles were used for reference, but because this project optimized for speed, I did not do a full text analysis.
Pre-Cleaning Steps
The first move for any survey is to change column names into more manageable short codes and then delete identifying information to preserve privacy, for example:
df1 <- df %>%
# Rename: shorten column names to be manageable
rename(
timestamp = "Timestamp",
daos_work_for = "what DAO(s) do you work for? for each DAO, how many hours/month do you work? (feel free to include multiple)",
city = "what city are you based in?",
twitter = "whats your twitter username?",
eth_addr = "whats your ETH address?",
) %>%
# delete identifying information
select(-twitter, -eth_addr, -email)
Baseline Step: Convert Text to Tidy Format
This requires the tidytext
package and a couple functions. The flow is to use unnest_tokens()
to separate a string of words into a vector of individual words. Then follow-up with anti_join()
to get rid of stop words (a corpus of words is provided with tidytext).
Then, group and tally, which can be achieved with group_by()
and tally(sort = TRUE)
or one function count(, sort = TRUE)
.
daos_work_tbl %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
view()
If there are too many words, we can filter()
and drop NA responses. With dplyr
these operations can be chained to ggplot2
to visualize the output.
daos_work_tbl %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n > 3) %>%
drop_na() %>%
String Detect
Sometimes you need to use str_detect()
to see how many instances of a string are present in a column. If there is a match of string detected, you want to categorize survey responses. This is structured combining if_else()
conditionals with str_detect()
.
This first requires creating an empty column:
# create empty column
daos_work_long$bin <- NA
# use if_else and str_detect
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "cre8")==TRUE), "cre8rdao", "NA")
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "mstable")==TRUE), "mstable", daos_work_long$bin)
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "marrow")==TRUE), "marrow dao", daos_work_long$bin)
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "badger")==TRUE), "badger dao", daos_work_long$bin)
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "raid")==TRUE), "raid guild", daos_work_long$bin)
daos_work_long$bin <- if_else((str_detect(daos_work_long$word, "metagame")==TRUE), "metagame", daos_work_long$bin)
String Match
In some situations, you may want to see if a string contains a specific word. The function to use here is str_match()
. Here, we’re seeing if a string contains either yes
or yeah
or no
or not
:
income_stability_tbl2 <- income_stability_tbl %>%
mutate(phrase = strsplit(as.character(text), ",")) %>%
unnest(phrase) %>%
count(phrase, sort = TRUE) %>%
mutate(
phrase_no = str_match(phrase, "[Nn]o|[Nn]ot")[,1],
phrase_no = str_to_lower(phrase_no)
) %>%
mutate(
phrase_yes = str_match(phrase, "[Yy]es|[Yy]eah")[,1],
phrase_yes = str_to_lower(phrase_yes)
)
Handling each survey question (column) separately
This requires splitting each column off. You could turn it into a vector
first, then tibble
or just subset a dataframe:
comp_denom_v <- as.vector(df1$comp_denom)
comp_denom_tbl <- tibble(line = 1:445, text = comp_denom_v)
Manually add numbers
Surprisingly, it was not easy to add items from the same category:
Item | Number |
---|---|
Zebra | 8 |
Zebra | 17 |
It should be more straight forward to add Zebra. But instead we have to really manually add. For example, here i’m manually changing the n
for bankless dao to 35
:
# bankless dao = 35
daos_work_long2$n[12] <- 35
Delete specific rows
There are two ways to delete rows. First is to subset (a base R operation):
daos_work_long3 <- daos_work_long2[-c(4, 6, 13, 14, 15, 19, 24, 28, 31, 42, 46, 51, 52, 53,
61, 63, 67, 68, 69, 70, 78, 81, 87, 95, 100, 103),] %>%
arrange(desc(n))
The second way is to use slice
in dplyr
. Slice
can be used to select or re-order rows as well:
usd_earning_tbl3 <- usd_earning_tbl2 %>%
slice(4, 6, 7, 1:3, 5, 8:9)
Assigning Factors to Preserve Order for Visualization
After using slice
to re-order rows, we can use mutate()
and as_factor()
to create factors for visualization. This preserves the order we want (e.g., age range on the x-axis):
# reorder rows, save as new df
usd_earning_tbl3 <- usd_earning_tbl2 %>%
slice(4, 6, 7, 1:3, 5, 8:9)
# need to sort by factors before visualize
usd_earning_tbl3 %>%
mutate(text_factor = as_factor(text))
Separate String at Comma
Sometimes, simply turning a string into tidytext
doesn’t work because meaning phrases of two or three words inadvertently get split, so we may need to split by comma with mutate()
and strsplit()
, in lieu of using unnest_tokens()
, then group and tally:
task_tabl2 <- task_tbl %>%
mutate(phrase = strsplit(as.character(text), ",")) %>%
unnest(phrase) %>%
count(phrase, sort = TRUE) %>%
view()
Github Repo
See data cleaning scripts here