5 Data Preparation and Cleaning
A commonly cited estimate in data analysis is that up to 80% of the work involves cleaning and preparing the data. This is not just a preliminary step — it is a continuous necessity throughout any BI project. As new data arrives, variables need recoding, formats need standardizing, and errors need correcting. The skills in this chapter — structuring data into tidy format, manipulating it with dplyr, reshaping it with tidyr, joining datasets, and handling missing values and duplicates — form the practical backbone of every analysis in this textbook and in professional BI work.
Chapter Goals
Upon concluding this chapter, readers will be able to:
- Define tidy data and explain why structuring datasets according to tidy principles simplifies analysis.
- Apply
dplyrfunctions —filter(),select(),mutate(),summarize(), andgroup_by()— to manipulate data frames. - Reshape data between wide and long formats using
tidyr’spivot_longer()andpivot_wider(). - Combine datasets from multiple sources using
dplyrjoin functions and identify which join type fits a given scenario. - Identify and handle missing values and duplicate observations in a dataset, and explain how AI tools can assist with data cleaning tasks.