5 Data Preparation and Cleaning

A commonly cited estimate in data analysis is that up to 80% of the work involves cleaning and preparing the data. This is not just a preliminary step — it is a continuous necessity throughout any BI project. As new data arrives, variables need recoding, formats need standardizing, and errors need correcting. The skills in this chapter — structuring data into tidy format, manipulating it with dplyr, reshaping it with tidyr, joining datasets, and handling missing values and duplicates — form the practical backbone of every analysis in this textbook and in professional BI work.

Chapter Goals

Upon concluding this chapter, readers will be able to:

  1. Define tidy data and explain why structuring datasets according to tidy principles simplifies analysis.
  2. Apply dplyr functions — filter(), select(), mutate(), summarize(), and group_by() — to manipulate data frames.
  3. Reshape data between wide and long formats using tidyr’s pivot_longer() and pivot_wider().
  4. Combine datasets from multiple sources using dplyr join functions and identify which join type fits a given scenario.
  5. Identify and handle missing values and duplicate observations in a dataset, and explain how AI tools can assist with data cleaning tasks.