5.4 Data Manipulation with dplyr

dplyr is the core tidyverse package for data manipulation (Wickham et al. 2023). It provides a consistent set of functions — called “verbs” — for the most common data operations. We demonstrate each using the built-in mtcars dataset.

data("mtcars")
head(mtcars)

5.4.1 Core Functions and Examples

5.4.1.1 Filtering with filter()

filter() extracts rows that meet specified conditions.

filtered_data <- mtcars |>
  filter(cyl > 6)

head(filtered_data)

5.4.1.2 Selecting with select()

select() isolates specific columns from a data frame.

selected_data <- mtcars |>
  select(mpg, hp, wt)

head(selected_data)

5.4.1.3 Arranging with arrange()

arrange() reorders rows by the values in one or more columns. Use desc() for descending order.

arranged_data <- mtcars |>
  arrange(desc(hp))

head(arranged_data)

5.4.1.4 Mutating with mutate()

mutate() adds new columns or modifies existing ones. Note that wt in mtcars is recorded in thousands of pounds (e.g., 2.62 = 2,620 lbs).

mutated_data <- mtcars |>
  mutate(wt_kg = wt * 453.592)

head(mutated_data)

5.4.1.5 Summarizing with summarize()

summarize() reduces data to summary statistics. It is most powerful when combined with group_by().

summary_data <- mtcars |>
  summarize(average_mpg = mean(mpg, na.rm = TRUE))

head(summary_data)

5.4.1.6 Grouping and Summarizing with group_by() and summarize()

group_by() segments a data frame into groups based on one or more variables. Any subsequent summarize() call computes statistics within each group rather than across the entire dataset.

# Group data by the number of cylinders
average_mpg_by_cyl <- mtcars |>
  group_by(cyl) |>
  summarize(average_mpg = mean(mpg, na.rm = TRUE))

head(average_mpg_by_cyl)

In this example, group_by(cyl) segments the data by number of cylinders, and summarize() computes the mean mpg within each group — revealing that fuel efficiency decreases as engine size increases.

5.4.1.7 Counting with count()

count() is a convenient shorthand for group_by() |> summarize(n = n()) — one of the most common operations in exploratory analysis.

# Count the number of cars by cylinder count
mtcars |> count(cyl)