Approaches to finding Non-Matching Rows between Two Columns
Approaches to finding non-matching rows between 2 columns
Situation: You have two email lists - a full and already_sent. The latter is a subset of the former.
The first condition to test is that the full
and already_sent
lists are mutually exclusive.
Second, the examples below assume the columns are in different dataframes.
Third, you generally put the longer column first in a function.
Then, you want to find non-matching rows.
Demo Two Tables
Table 1: a1
a | b |
---|---|
1 | a |
2 | b |
3 | c |
4 | d |
5 | e |
Table 2: a2
a | b |
---|---|
1 | a |
2 | b |
3 | c |
# generate two dataframes to demo
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
Dplyr: anti_join
The anti_join
function comes with dplyr
. You will find non-matching rows between 2 tables;
library(dplyr)
anti_join(a1, a2, by = 'b')
Dplyr: setdiff
This returns the non-matching rows by individual characters. You’ll need to wrap the output in a data.frame()
. This solution is not as clean as a anti_join
or the setDT
down below.
library(dplyr)
# individual characters
setdiff(a1$b, a2$b)
# wrapped in a dataframe
data.frame(setdiff(a1$b, a2$b))
Data.Table: setDT
library(data.table)
setDT(a1)[!a1$b %chin% a2$b]
Double check in Excel
Check if values column a1
exists in column a2
. You’re going to create a middle column between the two with values of either Exist
, where both emails exists or Not Exist
, where one column is missing.
Assuming a1
starts on cell A4.
Assuming a2
starts on cell C4.
In the scenario of Full Email List vs Emails Sent case, a1
(Full Email List) had 4286 rows, while a2
(Emails Sent) had 2373 rows. Here’s the Excel function:
# Excel
=IF(ISERROR(VLOOKUP(C2,$A$2:$A$4286, 1, FALSE)),"Not Exist","Exist" )
Finding Non-Matching Rows: anti_join()
We often need to join two columns from different data frames. Rows to be joined are assumed to have the same value.
Even different casing means those values will not be joined. For example: “Nigeria” and “NIGERIA” will not be joined.
It’s particularly useful to know which values are not in sync when you have a list of countries and you want to join with one of the map
libraries (e.g., ggmap
). If the country is spelt differently, the join doesn’t happen.
Enter anti_join:
world_map1 <- world_map %>%
mutate(id = region)
df1 <- df %>%
mutate(id = country)
# use anti_join to figure out which rows are not aligned
anti_join(world_map1, df1, by = "id")