5 Data Preparation and Cleaning

A commonly cited estimate in data analysis is that up to 80% of the work involves cleaning and preparing the data. This is not just a preliminary step — it is a continuous necessity throughout any BI project. As new data arrives, variables need recoding, formats need standardizing, and errors need correcting. The skills in this chapter — structuring data into tidy format, manipulating it with dplyr, reshaping it with tidyr, joining datasets, and handling missing values and duplicates — form the practical backbone of every analysis in this textbook and in professional BI work.

Chapter Goals

Upon concluding this chapter, readers will be able to:

Define tidy data and explain why structuring datasets according to tidy principles simplifies analysis.
Apply dplyr functions — filter(), select(), mutate(), summarize(), and group_by() — to manipulate data frames.
Reshape data between wide and long formats using tidyr’s pivot_longer() and pivot_wider().
Combine datasets from multiple sources using dplyr join functions and identify which join type fits a given scenario.
Identify and handle missing values and duplicate observations in a dataset, and explain how AI tools can assist with data cleaning tasks.