5.3 The Tidyverse
The Tidyverse is an ecosystem of R packages designed for data science that share a coherent design philosophy (Wickham et al. 2019). Understanding this philosophy — not just the individual functions — is what makes the tidyverse approach powerful and transferable.
5.3.1 Design Philosophy
The tidyverse is built around several key principles:
Compose simple operations. Rather than building monolithic functions that do everything at once, tidyverse functions each do one thing well. Complex transformations are expressed by chaining simple operations together with the pipe operator (|>). This makes code easier to write, read, and debug — you can inspect the result at any point in the chain.
Consistency across packages. Tidyverse packages share naming conventions, argument patterns, and data structures. Once you learn how dplyr::filter() works, you can predict how other tidyverse functions will behave. The first argument is always the data, functions are named as verbs that describe what they do, and output is always a data frame.
Tidy data as the common currency. All tidyverse tools expect tidy input and produce tidy output. This means you spend less time converting between formats and more time on analysis. The tidy data principles covered in the previous section are the structural foundation that makes this possible.
Human-readable code. Tidyverse code is designed to read like a description of what it does. A pipeline like data |> filter(age > 30) |> group_by(department) |> summarize(avg_salary = mean(salary)) reads almost like English: “take the data, keep rows where age is over 30, group by department, and compute the average salary.” This readability is critical in professional BI settings where code must be reviewed, maintained, and trusted by others.
5.3.2 Core Packages
The tidyverse includes many packages, but this chapter focuses on the core tools for data preparation:
| Package | Purpose | Key Functions |
|---|---|---|
dplyr |
Data manipulation | filter(), select(), mutate(), summarize(), group_by(), *_join() |
tidyr |
Reshaping data | pivot_longer(), pivot_wider() |
readr |
Importing flat files | read_csv(), read_delim(), read_tsv() |
ggplot2 |
Visualization | Covered in Chapter 7 |
stringr |
String manipulation | str_replace(), str_detect(), str_to_lower() |
forcats |
Factor handling | fct_relevel(), fct_recode(), fct_lump() |
Additional packages like lubridate (dates and times), purrr (functional programming), and tibble (enhanced data frames) round out the ecosystem. You do not need to memorize all of these — the key point is that they work together seamlessly because they share the same design philosophy.
5.3.3 Loading the Tidyverse
We load the tidyverse with pacman, as introduced in Chapter 3:
This single call loads the core tidyverse packages. You will see a startup message listing which packages were loaded and any function name conflicts with base R — this is normal and can be ignored.
5.3.4 Importing Data with readr
The first step in most analyses is importing data from an external file. The readr package provides fast, consistent functions for reading flat files:
# Read a comma-separated file
data <- read_csv("data/my_dataset.csv")
# Read a file with a different delimiter (e.g., semicolons)
data <- read_delim("data/my_dataset.csv", delim = ";")read_csv() and read_delim() automatically detect column types, handle quoted strings, and parse dates. They also produce a “tibble” — the tidyverse’s enhanced data frame — and print a column specification message showing how each column was parsed. In Chapter 6, we use read_delim() to load the Absenteeism at Work dataset, which uses semicolons as delimiters.