5.1 Introduction

Data preparation is the bridge between raw data and useful analysis. No matter how sophisticated a model or visualization, its value depends on the quality and structure of the data behind it. In practice, data rarely arrives in a form ready for analysis — it contains missing values, inconsistent formats, duplicated records, and variables spread across multiple files or arranged in inconvenient layouts (Wickham 2014).

5.1.1 The Data Preparation Workflow

Data preparation in a BI context typically follows a general workflow:

  1. Data ingestion — Importing data from its source: databases (via SQL queries), flat files (CSV, Excel), APIs, or web scraping. The format and quality of the incoming data determine how much cleaning is needed.
  2. Initial inspection — Examining the structure, data types, dimensions, and summary statistics of the dataset. Functions like str(), summary(), head(), and glimpse() are essential at this stage.
  3. Cleaning — Addressing data quality issues: correcting data types, handling missing values, removing duplicates, standardizing formats, and recoding variables.
  4. Transformation — Reshaping, filtering, aggregating, and creating derived variables to prepare the data for a specific analysis or model.
  5. Validation — Verifying that the cleaned dataset meets expectations: correct dimensions, expected ranges, no remaining anomalies.
  6. Documentation — Recording what was done and why, through code comments, data dictionaries, and organized project structures.

This workflow is not strictly linear — you will often cycle back to earlier steps as you discover new issues. But having a systematic approach prevents ad hoc cleaning decisions that are difficult to reproduce or explain.

5.1.2 Chapter Overview

This chapter covers the principles behind each step in this workflow and the R tools to execute them. We begin with the foundations: tidy data, the characteristics of analysis-ready data, and project organization. We then introduce the Tidyverse and work through its core operations: manipulating data with dplyr, reshaping it with tidyr, joining datasets from multiple sources, and handling missing values and duplicates. We also show how dplyr operations map to SQL — the standard language for querying databases — and examine how AI tools are beginning to assist with data preparation.

The R skills from Chapter 3 — variables, data frames, functions, and the pipe operator — are put to extensive use here. If any of those concepts feel unfamiliar, revisit the relevant sections before proceeding.