Data Manipulation using DPLYR : Part 3

Merging Data Using DPLYR

In this blog, you will learn how to Merge Data using the dplyr R package. When you will be working with multiple datasets then often, you will need to combine these datasets to perform the analysis that you are interested in. The dplyr package provides several join functions to combine relational data. The fundamental syntax for each of these function is:

xxxx_join(x = , y = , by = )

The DPLYR Join:

The dplyr join functions can be broken down into two categories, mutating joins and filtering joins:

1] Mutating Joins :- Add new variables to one data frame from matching observations in another.

  • inner_join : Return only rows that appear in both datasets.

  • full_join : Returns all rows in both data sets

  • left_join : Returns all rows from x and matching rows from y

  • right_join : Returns all rows from y and the matching rows in x

2] Filtering Joins :- Filter observations from one data frame based on whether or not they match an observation in another.

  • semi_join : Returns rows of x that are in y

  • anti_join : Returns rows of x that don’t match in y

Required R package

First, you need to install the dplyr package and load the dplyr library then after you can able to perform the following data merging operations.

Mutating Inner Joins

An inner join matches the pairs of observations whenever their keys are equal. As an example, let’s look at the following two data frames.

library(dplyr)std1 <- data.frame(Student_Id = c(1012301, 1012302, 1012303, 1012304),
 Age = c(20, 19, 20, 19))


std2 <- data.frame(Student_Id = c(1012301, 1012302, 1012304, 1012305),
                      Sex = c('M', 'M', 'F', 'F'))


In this example, Student_Id 1012301, 1012302 and 1012304 are common among both datasets.

InnerJoin <- inner_join(x = std1, y = std2, by = "Student_Id")


Mutating Outer Joins(Full, Left, Right)

An outer join keeps observations that appear in at least one of the datasets. Let’s again look at the same two data frames.

Full Join:- A full join results in Student_Id 1012303 and 1012305 appearing in the output dataset despite being unique to std1 and std2 respectively:

FullJoin <- full_join(x = std1, y = std2, by = "Student_Id")


Left Join:- Left join keeps all observations in x, and matching observations from y. A left join results in a dataset with Student_Id’s 1012301, 1012302, 1012303 and 1012304, plus the data from std2 for Student_Id’s 1012301, 1012302 and 1012304.

LeftJoin <- left_join(x = std1, y = std2, by = "Student_Id")


Right Join:- Right join keeps all observations in y, and matching observations from x. A left join results in a dataset with Student_Id’s 1012301, 1012302, 1012304 and 1012305, plus the data from std2 for Student_Id’s 1012301, 1012302 and 1012304.

RightJoin <- right_join(x = std1, y = std2, by = "Student_Id")