5.10 Summary

This chapter covered both the principles and the practice of data preparation. We began with the foundations: tidy data structure, the qualities of analysis-ready data, and the importance of project organization and documentation. These principles ensure that your data work is not only correct but reproducible and understandable by others.

We then worked through the core tidyverse tools: dplyr for filtering, selecting, transforming, and summarizing data; tidyr for reshaping between wide and long formats; and join functions for combining datasets from multiple sources. We addressed two of the most common data quality problems — missing values and duplicates — and demonstrated both removal and imputation strategies. Finally, we introduced how AI coding assistants can accelerate data preparation while emphasizing that the analyst’s understanding of the data remains the essential safeguard against errors.

In the next chapter, we apply these techniques to the Absenteeism at Work dataset, walking through a complete data cleaning pipeline from raw import to analysis-ready data frame — putting the principles of project organization and data documentation into practice.