5 Data Preparation and Cleaning

5.1 Chapter Goals

Upon concluding this chapter, readers will be equipped with the skills to:

Understand the principles of tidy data and its importance in data analysis, including the organization of datasets to facilitate easy access and manipulation.
Utilize the Tidyverse collection of R packages for efficient data import, tidying, manipulation, and visualization, enhancing the data analysis workflow.
Implement data importing techniques for common file formats such as CSV and Excel, using Tidyverse packages like readr and readxl.
Apply dplyr functions for data manipulation tasks, including filtering, selecting, arranging, mutating, and summarizing datasets to extract meaningful insights.
Conduct data tidying operations with tidyr, transforming datasets between wide and long formats, to adhere to tidy data principles and prepare data for analysis.

5.2 The Essence of Tidy Data

The often-quoted statistic in data analysis is that 80% of the work involves cleaning and preparing the data, a task that is not just a preliminary step but a continuous necessity throughout the analytical process. As new challenges emerge or additional data is gathered, the cycle of cleaning and preparation repeats. This chapter delves into a crucial subset of data cleaning known as data tidying, which is the practice of structuring datasets to streamline analysis.

Data tidying is underpinned by a set of principles that provide a standardized approach to organizing the values within a dataset. Such a standard is invaluable as it simplifies the initial stages of data cleaning, eliminating the need to reinvent procedures for each new dataset. The tidy data paradigm is designed to ease the exploration and analysis of data, as well as the development of analysis tools that are interoperable, reducing the time spent on translating outputs from one tool to input for another. Tidy datasets and tools that adhere to this standard work in tandem, allowing analysts to concentrate on the substantive questions of their domain rather than the logistical hurdles of data manipulation.

5.2.1 Defining Tidy Data

The concept of tidy data draws an analogy to Leo Tolstoy’s observation about families: while every tidy dataset shares common characteristics, each messy dataset is unique in its disorder. Tidy datasets align the dataset’s structure—its physical layout—with its semantics—the meaning of its data. This alignment involves a clear vocabulary for describing data structure and semantics, leading to a definition of tidy data.

5.2.1.1 Data Structure

Statistical datasets are typically structured as data frames, consisting of rows and columns, with columns usually labeled and rows occasionally so. However, the mere description of data in terms of rows and columns is insufficient to capture the essence of a dataset’s structure, as the same data can be organized in various layouts without altering its underlying meaning.

5.2.1.2 Data Semantics

A dataset comprises values, which can be numbers (quantitative) or strings (qualitative), organized by variables and observations. A variable groups all values measuring the same attribute across different units, while an observation aggregates all values from the same unit across various attributes.

5.2.2 Principles of Tidy Data

In the realm of tidy data, the organization of data adheres to three fundamental principles:

Each variable forms a column: This principle ensures that each column in the dataset represents one and only one variable, facilitating easy access to all values associated with a variable.
Each observation forms a row: This ensures that each row in the dataset contains all information pertaining to a single observation, making it easier to analyze or compare individual data points.
Each value occupies a single cell: By adhering to this principle, the dataset maintains a granularity that ensures each cell in the dataset contains only one piece of information, thereby eliminating ambiguity and complexity in data interpretation.

These principles align with Codd’s 3rd normal form, a concept from database design, but are reframed in statistical language and applied to a single dataset rather than to the interconnected datasets typical in relational databases. By contrast, any dataset arrangement that deviates from these principles is considered “messy” data.

The tidy data format not only clarifies the dataset’s semantics but also explicitly defines an observation, aiding in the identification of missing values and their implications. For instance, in a classroom dataset, an absence might be indicated by a missing value, influencing the approach to computing final grades or class averages.

5.2.3 The Advantages of Tidy Data

Tidy data offers a standardized method for mapping a dataset’s meaning to its structure, simplifying the extraction of variables for analysis. This standardization is particularly beneficial in vectorized programming languages like R, where tidy data aligns with the language’s data manipulation and analysis paradigms. Moreover, tidy data facilitates a logical ordering of variables and observations, enhancing readability and easing the identification of variables fixed by design versus those measured during an experiment.

5.3 Introduction to the Tidyverse

The Tidyverse is an ecosystem of R packages designed to facilitate data science tasks with a consistent philosophy, grammar, and data structure. It simplifies the process of data import, cleaning, manipulation, visualization, and analysis. Core packages include ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, each serving a specific function in the data science workflow.

5.3.1 Core Packages Overview

dplyr: For data manipulation, offering a set of verbs for data operations.
ggplot2: For data visualization, based on the Grammar of Graphics.
tidyr: For data tidying, ensuring a consistent structure.
readr: For importing data from flat files like CSV and TSV.

5.4 Installing R and the Tidyverse

To streamline package management, use the pacman package. pacman allows for efficient installation and loading of R packages. Install pacman and then use it to install and load the Tidyverse with the following code:

# Install pacman if not already present
if (!require("pacman")) install.packages("pacman")

Loading required package: pacman

# Use pacman to install and load Tidyverse
pacman::p_load("tidyverse")

This script checks for pacman and installs it if necessary. It then uses pacman to load the Tidyverse, installing it first if it’s not already installed on your system.

5.5 Data Manipulation with `dplyr`

dplyr is a cornerstone package within the Tidyverse for data manipulation in R. It introduces several key functions that make data manipulation tasks more intuitive and efficient. Below, we explore these functions using the mtcars dataset, a classic dataset built into R that comprises various aspects of automobile design and performance for 32 automobiles.

data("mtcars")
head(mtcars)

dplyr introduces key functions for efficient data manipulation:

Filtering: filter() extracts subsets of rows based on conditions.
Selecting: select() focuses on specific columns.
Arranging: arrange() reorders rows.
Mutating: mutate() adds or modifies columns.
Summarizing: summarize() reduces data to summary statistics.

5.5.1 Core Functions and Examples

5.5.1.1 Filtering with `filter()`

The filter() function extracts subsets of rows from a dataframe based on specified conditions. It’s useful for focusing on observations that meet certain criteria.

Example: Select cars with more than 6 cylinders from the mtcars dataset.

filtered_data <- mtcars |> 
  filter(cyl > 6)

head(filtered_data)

5.5.1.2 Selecting with `select()`

The select() function allows you to isolate specific columns from a dataframe, simplifying the dataset to only include the information you need.

Example: Choose columns related to car performance, such as miles per gallon (mpg), horsepower (hp), and weight (wt).

selected_data <- mtcars |> 
  select(mpg, hp, wt)

head(selected_data)

5.5.1.3 Arranging with `arrange()`

arrange() reorders the rows of a dataframe based on the values in one or more columns. It’s helpful for sorting your data in an ascending or descending order.

Example: Order the cars by decreasing horsepower.

arranged_data <- mtcars |> 
  arrange(desc(hp))

head(arranged_data)

5.5.1.4 Mutating with `mutate()`

The mutate() function is used to add new columns to a dataframe or modify existing ones. This is often done by performing operations on other columns.

Example: Create a new column for car weight in kilograms, assuming 1 lb = 0.453592 kg.

mutated_data <- mtcars |> 
  mutate(wt_kg = wt * 0.453592)

head(mutated_data)

5.5.1.5 Summarizing with `summarize()`

summarize() reduces your data to summary statistics. This function is particularly powerful when used in conjunction with group_by(), allowing you to compute summaries for groups of data.

Example: Calculate the average miles per gallon across all cars.

summary_data <- mtcars |> 
  summarize(average_mpg = mean(mpg, na.rm = TRUE))

head(summary_data)

Each of these functions plays a vital role in the data manipulation process, enabling you to filter, select, arrange, mutate, and summarize your data efficiently. By integrating these functions into your data analysis workflow, you can enhance the clarity and efficiency of your R programming.

5.5.1.6 Grouping and Summarizing with `group_by()` and `summarize()`

While summarize() is great for reducing your entire dataset to summary statistics, its true power is unleashed when combined with group_by(). This pairing allows you to compute summary statistics for distinct groups within your data, making it invaluable for comparative analysis.

Example: Calculate the average miles per gallon (mpg) for each number of cylinders (cyl) in the mtcars dataset.

# Group data by the number of cylinders
average_mpg_by_cyl <- mtcars |> 
  group_by(cyl) |> 
  summarize(average_mpg = mean(mpg, na.rm = TRUE))

head(average_mpg_by_cyl)

In this example, group_by(cyl) segments the mtcars dataset into groups based on the number of cylinders (cyl). Then, summarize(average_mpg = mean(mpg, na.rm = TRUE)) calculates the average miles per gallon for each group of cylinders. This approach is particularly useful for exploring how certain characteristics (like fuel efficiency in this case) vary across different categories (such as engine size denoted by the number of cylinders).

5.6 Data Tidying with `tidyr` using Pipe Operator

tidyr is a crucial component of the Tidyverse, designed to facilitate the tidying of data into a format suitable for analysis. The pivot_longer() and pivot_wider() functions are key for reshaping data frames, and their use alongside the pipe operator (|>) significantly improves the readability and efficiency of the code.

5.6.1 Pivoting Longer with `pivot_longer()`

The pivot_longer() function is adept at transforming data from a wide format, where multiple columns represent different levels of a single variable, into a longer format, where all data for a variable is contained in a single column.

Example with Hypothetical Survey Data: In this example, we’ll work with a fabricated wide-format dataset named survey_data, representing average satisfaction scores for three groups (A, B, and C) over two years (2021 and 2022).

# Hypothetical survey data in a wide format
survey_data <- tibble(
  year = c(2021, 2022),
  A = c(8.2, 8.4),
  B = c(7.8, 7.9)
)
survey_data

# Transforming the wide-format survey data into a longer format
long_survey_data <- survey_data |> 
  pivot_longer(
    cols = -year, 
    names_to = "Group", 
    values_to = "Satisfaction_Score"
  )
long_survey_data

In this transformation, pivot_longer() is used to consolidate the six original columns into three: Group, Year, and Satisfaction_Score. The names_sep argument is particularly useful here, as it specifies the character that separates the components of the original column names, allowing tidyr to split these names into separate variables correctly.

# Transforming the long-format data back into a wide format
wide_survey_data_again <- long_survey_data |>
  pivot_wider(
    names_from = c(Group), 
    values_from = Satisfaction_Score
  )
wide_survey_data_again

In this transformation, pivot_wider() takes the long_survey_data and spreads the Group values across multiple columns, using Satisfaction_Score as the values to fill the cells. The names_from argument specifies which column in the long-format data will be used to create the new column names in the wide-format data, while values_from specifies which column provides the values. The names_sep argument is used to define a separator that will be inserted between the elements of the new column names, which in this case are derived from the Group and Year columns.

The result, wide_survey_data_again, is a dataframe where each satisfaction score is associated with a specific group and year, similar to the original survey_data but potentially reordered or reformatted based on the transformations applied during the pivot processes.

5.7 Joining Data Sets with `dplyr`

Combining information from different sources is a common task in data analysis. dplyr offers a variety of join functions to facilitate this, each tailored to different joining scenarios. dplyr offers several join functions:

inner_join(): Returns rows with matching values in both datasets.
left_join(): Returns all rows from the left dataset, with matching rows from the right dataset.
right_join(): Returns all rows from the right dataset, with matching rows from the left dataset.
full_join(): Returns all rows from both datasets, with matches where available.
semi_join(): Returns all rows from the left dataset where there are matching values in the right dataset, but only columns from the left dataset.
anti_join(): Returns all rows from the left dataset where there are no matching values in the right dataset, keeping only columns from the left dataset.

5.7.1 Preparing Sample Data Frames

Before discussing the various join functions available in dplyr for combining datasets, let’s set up two sample data frames, mtcars1 and mtcars2, to use as examples. These data frames will be subsets of the built-in mtcars dataset in R, with some modifications to illustrate different joining scenarios.

5.7.1.1 `mtcars1` DataFrame

This data frame will include a subset of the original mtcars columns, such as mpg, cyl, disp, and hp, along with a new column car_model as a unique identifier.

mtcars1 <- mtcars[1:15, c("mpg", "cyl", "disp", "hp")]
mtcars1$car_model <- rownames(mtcars[1:15, ])
mtcars1$df.1 <- "mtcars1"
mtcars1

5.7.1.2 `mtcars2` DataFrame

This data frame will contain a different subset of columns, namely gear and carb, along with the car_model column. It will include some car models not present in mtcars1 and exclude others that are, to demonstrate various join behaviors.

mtcars2 <- mtcars[10:25, c("gear", "carb")]
mtcars2$car_model <- rownames(mtcars[10:25, ])
mtcars2$df.2 <- "mtcars2"
mtcars2

With these sample data frames, we can explore how different dplyr join functions operate.

5.7.2 `inner_join()`

Purpose: Merges rows with matching values in both tables, providing a dataset that includes only the rows common to both tables.

Figure 5.1: Venn diagram showcasing the inner join between the two datasets: `mtcars1` and `mtcars2`. The highlighted section depicts the result of the inner join–the intersection of the two datasets. In other words, the rows with matching entries in the key columns of both datasets

Example:

# Inner join between mtcars1 and mtcars2 on car_model
joined_data_inner <- inner_join(mtcars1, mtcars2, by = "car_model")
joined_data_inner

This join results in a dataset containing cars present in both mtcars1 and mtcars2, with all columns from both datasets included for the matching rows.

5.7.3 `left_join()`

Purpose: Returns all rows from the left table (first specified), along with the matching rows from the right table. Rows in the left table without matches in the right table will have NA in the columns from the right table.

Figure 5.2: Venn diagram showcasing the left join between the two datasets: `mtcars1` and `mtcars2`, where the left dataset is `mtcars1`. The highlighted section depicts the result of the left join–all rows in the left dataset `mtcars1` plus the intersection of the two datasets. In other words, the rows with matching entries in the key columns of both datasets

Example:

# Left join mtcars1 with mtcars2 on car_model
joined_data_left <- left_join(mtcars1, mtcars2, by = "car_model")
joined_data_left

This join produces a dataset with all cars from mtcars1, supplemented with gear and carb data from mtcars2 where available.

5.7.4 `right_join()`

Purpose: Returns all rows from the right table, with matching rows from the left table. Rows in the right table without matches in the left table will have NA in the columns from the left table.

Figure 5.3: Venn diagram showcasing the right join between the two datasets: `mtcars1` and `mtcars2`, where the right dataset is `mtcars2`. The highlighted section depicts the result of the right join–all rows in the right dataset `mtcars2` plus the intersection of the two datasets. In other words, the rows with matching entries in the key columns of both datasets

Example:

# Right join mtcars1 with mtcars2 on car_model
joined_data_right <- right_join(mtcars1, mtcars2, by = "car_model")
joined_data_right

This join yields a dataset containing all cars from mtcars2, with mpg, cyl, disp, and hp data from mtcars1 where matches are found.

5.7.5 `full_join()`

Purpose: Combines all rows from both tables, with matches where available. Rows without matches in the opposite table will have NA in the corresponding columns.

Figure 5.4: Venn diagram showcasing the full join between the two datasets: `mtcars1` and `mtcars2`. The highlighted section depicts the result of the full join–all rows in both datasets. Rows in the intersection of the two datasets will have entries in all columns. Rows not in both datasets will have NA’s in missing columns.

Example:

# Full join between mtcars1 and mtcars2 on car_model
joined_data_full <- full_join(mtcars1, mtcars2, by = "car_model")
joined_data_full

This join results in a dataset that includes all cars from both mtcars1 and mtcars2, ensuring no data is lost, and filling missing values with NA.

5.7.6 `semi_join()`

Purpose: Returns all rows from the left table where there are matching values in the right table, but only includes columns from the left table.

Example:

# Semi join mtcars1 with mtcars2 on car_model
joined_data_semi <- semi_join(mtcars1, mtcars2, by = "car_model")
joined_data_semi

This join creates a dataset with rows from mtcars1 that have a corresponding match in mtcars2, but only retains columns from mtcars1.

5.7.7 `anti_join()`

Purpose: Returns all rows from the left table where there are no matching values in the right table, keeping only columns from the left table.

Example:

# Anti join mtcars1 with mtcars2 on car_model
joined_data_anti <- anti_join(mtcars1, mtcars2, by = "car_model")

This join produces a dataset consisting of cars in mtcars1 that do not have a match in mtcars2, with columns only from mtcars1.

5.7.8 Considerations

Be mindful of data duplication, particularly when key columns contain non-unique values.
Ensure that key columns used for joining have matching data types in both tables.
Resolve column name conflicts; dplyr will append suffixes to duplicate column names to distinguish them.

5.8 Data Cleaning Techniques

Before diving into data cleaning techniques, let’s create a sample DataFrame that contains both missing and duplicate observations to illustrate the process more concretely.

5.8.1 Creating a Sample DataFrame

We can create a sample DataFrame in R with the following characteristics:

Some rows have missing values (NA) in one or more columns.
Some rows are duplicates, meaning they are identical across all columns.

# Creating a sample DataFrame with missing values and duplicates
set.seed(123)  # For reproducibility
sensor_data <- data.frame(
  timestamp = seq(as.POSIXct("2024-01-01"), by = "hour", length.out = 20),
  temperature = rnorm(20, mean = 20, sd = 5),
  humidity = rnorm(20, mean = 50, sd = 10)
)

# Introducing missing values randomly
sensor_data[sample(1:20, 5), "temperature"] <- NA
sensor_data[sample(1:20, 5), "humidity"] <- NA

# Adding duplicate rows
sensor_data <- rbind(sensor_data, sensor_data[sample(1:20, 5), ])


sensor_data

In this sample sensor_data DataFrame, we have temperature and humidity readings taken hourly. We introduce missing values in both columns and add some duplicate rows to simulate a realistic dataset that needs cleaning.

5.8.2 Handling Missing Values

Managing missing values is a crucial step in data cleaning. The strategies for handling missing data include removal and imputation.

5.8.2.1 Removing Rows with Missing Values

To clean the dataset, rows with missing values can be removed if losing some data does not significantly impact the analysis.

# Removing rows with missing values
sensor_data_clean <- drop_na(sensor_data)
sensor_data_clean

Here, drop_na() is used on sensor_data to eliminate any row that contains at least one NA value, simplifying subsequent analyses but potentially losing valuable information.

5.8.2.2 Replacing Missing Values

Alternatively, missing values can be replaced with a specific value like the mean or median, which helps maintain the dataset size.

# Replacing missing values with the mean
mean_temp <- mean(sensor_data$temperature, na.rm = TRUE)
sensor_data_clean <- sensor_data |> 
  mutate(
    temperature_mean = ifelse(is.na(temperature),mean_temp,temperature)
  )
sensor_data_clean

In this case, missing temperature values are replaced with the column’s mean temperature, computed while ignoring NAs with na.rm = TRUE.

5.8.3 Dealing with Duplicates

Duplicate observations can bias the analysis, making it essential to remove them to maintain data integrity.

5.8.3.1 Removing Duplicate Rows

Removing duplicates ensures that each observation uniquely contributes to the analysis.

# Removing duplicate rows to retain only unique observations
sensor_data_unique <- distinct(sensor_data)
sensor_data_unique

Using distinct(), we remove duplicate rows from sensor_data, keeping only the first instance of each duplicate by default, thus ensuring data uniqueness.

By addressing both missing values and duplicates as demonstrated, we can enhance the quality and reliability of our data analyses.

5.1 Chapter Goals

5.2 The Essence of Tidy Data

5.2.1 Defining Tidy Data

5.2.1.1 Data Structure

5.2.1.2 Data Semantics

5.2.2 Principles of Tidy Data

5.2.3 The Advantages of Tidy Data

5.3 Introduction to the Tidyverse

5.3.1 Core Packages Overview

5.4 Installing R and the Tidyverse

5.5 Data Manipulation with dplyr

5.5.1 Core Functions and Examples

5.5.1.1 Filtering with filter()

5.5.1.2 Selecting with select()

5.5.1.3 Arranging with arrange()

5.5.1.4 Mutating with mutate()

5.5.1.5 Summarizing with summarize()

5.5.1.6 Grouping and Summarizing with group_by() and summarize()

5.6 Data Tidying with tidyr using Pipe Operator

5.6.1 Pivoting Longer with pivot_longer()

5.7 Joining Data Sets with dplyr

5.7.1 Preparing Sample Data Frames

5.7.1.1 mtcars1 DataFrame

5.7.1.2 mtcars2 DataFrame

5.7.2 inner_join()

5.7.3 left_join()

5.7.4 right_join()

5.7.5 full_join()

5.7.6 semi_join()

5.7.7 anti_join()

5.7.8 Considerations

5.8 Data Cleaning Techniques

5.8.1 Creating a Sample DataFrame

5.8.2 Handling Missing Values

5.8.2.1 Removing Rows with Missing Values

5.8.2.2 Replacing Missing Values

5.8.3 Dealing with Duplicates

5.8.3.1 Removing Duplicate Rows

5.5 Data Manipulation with `dplyr`

5.5.1.1 Filtering with `filter()`

5.5.1.2 Selecting with `select()`

5.5.1.3 Arranging with `arrange()`

5.5.1.4 Mutating with `mutate()`

5.5.1.5 Summarizing with `summarize()`

5.5.1.6 Grouping and Summarizing with `group_by()` and `summarize()`

5.6 Data Tidying with `tidyr` using Pipe Operator

5.6.1 Pivoting Longer with `pivot_longer()`

5.7 Joining Data Sets with `dplyr`

5.7.1.1 `mtcars1` DataFrame

5.7.1.2 `mtcars2` DataFrame

5.7.2 `inner_join()`

5.7.3 `left_join()`

5.7.4 `right_join()`

5.7.5 `full_join()`

5.7.6 `semi_join()`

5.7.7 `anti_join()`