5.4 Data Manipulation with dplyr
dplyr is the core tidyverse package for data manipulation (Wickham et al. 2023). It provides a consistent set of functions — called “verbs” — for the most common data operations. We demonstrate each using the built-in mtcars dataset.
5.4.1 Core Functions and Examples
5.4.1.3 Arranging with arrange()
arrange() reorders rows by the values in one or more columns. Use desc() for descending order.
5.4.1.4 Mutating with mutate()
mutate() adds new columns or modifies existing ones. Note that wt in mtcars is recorded in thousands of pounds (e.g., 2.62 = 2,620 lbs).
5.4.1.5 Summarizing with summarize()
summarize() reduces data to summary statistics. It is most powerful when combined with group_by().
5.4.1.6 Grouping and Summarizing with group_by() and summarize()
group_by() segments a data frame into groups based on one or more variables. Any subsequent summarize() call computes statistics within each group rather than across the entire dataset.
# Group data by the number of cylinders
average_mpg_by_cyl <- mtcars |>
group_by(cyl) |>
summarize(average_mpg = mean(mpg, na.rm = TRUE))
head(average_mpg_by_cyl)In this example, group_by(cyl) segments the data by number of cylinders, and summarize() computes the mean mpg within each group — revealing that fuel efficiency decreases as engine size increases.