5.3 The Tidyverse

The Tidyverse is an ecosystem of R packages designed for data science that share a coherent design philosophy (Wickham et al. 2019). Understanding this philosophy — not just the individual functions — is what makes the tidyverse approach powerful and transferable.

5.3.1 Design Philosophy

The tidyverse is built around several key principles:

Compose simple operations. Rather than building monolithic functions that do everything at once, tidyverse functions each do one thing well. Complex transformations are expressed by chaining simple operations together with the pipe operator (|>). This makes code easier to write, read, and debug — you can inspect the result at any point in the chain.

Consistency across packages. Tidyverse packages share naming conventions, argument patterns, and data structures. Once you learn how dplyr::filter() works, you can predict how other tidyverse functions will behave. The first argument is always the data, functions are named as verbs that describe what they do, and output is always a data frame.

Tidy data as the common currency. All tidyverse tools expect tidy input and produce tidy output. This means you spend less time converting between formats and more time on analysis. The tidy data principles covered in the previous section are the structural foundation that makes this possible.

Human-readable code. Tidyverse code is designed to read like a description of what it does. A pipeline like data |> filter(age > 30) |> group_by(department) |> summarize(avg_salary = mean(salary)) reads almost like English: “take the data, keep rows where age is over 30, group by department, and compute the average salary.” This readability is critical in professional BI settings where code must be reviewed, maintained, and trusted by others.

5.3.2 Core Packages

The tidyverse includes many packages, but this chapter focuses on the core tools for data preparation:

Package Purpose Key Functions
dplyr Data manipulation filter(), select(), mutate(), summarize(), group_by(), *_join()
tidyr Reshaping data pivot_longer(), pivot_wider()
readr Importing flat files read_csv(), read_delim(), read_tsv()
ggplot2 Visualization Covered in Chapter 7
stringr String manipulation str_replace(), str_detect(), str_to_lower()
forcats Factor handling fct_relevel(), fct_recode(), fct_lump()

Additional packages like lubridate (dates and times), purrr (functional programming), and tibble (enhanced data frames) round out the ecosystem. You do not need to memorize all of these — the key point is that they work together seamlessly because they share the same design philosophy.

5.3.3 Loading the Tidyverse

We load the tidyverse with pacman, as introduced in Chapter 3:

if (!require("pacman")) install.packages("pacman")
pacman::p_load("tidyverse")

This single call loads the core tidyverse packages. You will see a startup message listing which packages were loaded and any function name conflicts with base R — this is normal and can be ignored.

5.3.4 Importing Data with readr

The first step in most analyses is importing data from an external file. The readr package provides fast, consistent functions for reading flat files:

# Read a comma-separated file
data <- read_csv("data/my_dataset.csv")

# Read a file with a different delimiter (e.g., semicolons)
data <- read_delim("data/my_dataset.csv", delim = ";")

read_csv() and read_delim() automatically detect column types, handle quoted strings, and parse dates. They also produce a “tibble” — the tidyverse’s enhanced data frame — and print a column specification message showing how each column was parsed. In Chapter 6, we use read_delim() to load the Absenteeism at Work dataset, which uses semicolons as delimiters.