5.2 Principles of Data Preparation
5.2.1 Tidy Data
Data tidying is the practice of structuring datasets to streamline analysis (Wickham 2014). Rather than inventing a new cleaning strategy for each dataset, tidy data provides a standardized set of principles that, when followed, make every subsequent step — filtering, summarizing, visualizing, modeling — simpler and more predictable. Tidy datasets and tidy-compatible tools work in tandem, allowing analysts to focus on substantive questions rather than logistical hurdles.
5.2.1.1 Defining Tidy Data
A dataset is composed of values (numbers or text), organized by variables (attributes measured across units) and observations (all values from the same unit). The same underlying data can be arranged in many different layouts — some make analysis easy, others make it painful. Tidy data aligns a dataset’s physical structure with its meaning.
5.2.1.2 Principles of Tidy Data
Tidy data adheres to three principles:
Each variable forms a column: Every column represents one and only one variable.
Each observation forms a row: Every row contains all information for a single observation.
Each value occupies a single cell: Each cell holds exactly one piece of information.
These principles mirror Codd’s 3rd normal form from relational database design (Codd 1970), reframed in statistical language for a single dataset. Any arrangement that violates these principles produces “messy” data — and messy data is the norm in practice, which is why data tidying is such a critical skill.
5.2.1.3 The Advantages of Tidy Data
Tidy data provides a standardized mapping between a dataset’s meaning and its structure. This is particularly powerful in R, where the tidyverse packages are designed to work with tidy data natively — operations like filter(), mutate(), group_by(), and ggplot() all assume tidy input. When your data is tidy, these tools compose seamlessly; when it is not, you spend time restructuring before you can begin analysis.
5.2.2 What Does Analysis-Ready Data Look Like?
Beyond tidy structure, professionally prepared data exhibits several additional qualities that make it reliable and usable:
Correct data types. Every column should have the appropriate type — numeric columns should not contain text, dates should be parsed as date objects (not stored as character strings), and categorical variables should be encoded as factors with meaningful levels. Mistyped columns are one of the most common sources of downstream errors.
Consistent naming conventions. Column names should be descriptive, lowercase, and use a consistent separator (e.g., employee_id, absence_hours, service_time). Avoid spaces, special characters, and ambiguous abbreviations. The janitor::clean_names() function can standardize messy column names automatically.
No embedded calculations or formatting. Raw data should contain values, not formulas or formatted text. Percentages should be stored as decimals (0.15, not “15%”), currencies as numbers without symbols, and dates in ISO format (YYYY-MM-DD) where possible.
Documented units and definitions. Every variable should have a clear definition. What does “weight” mean — pounds or kilograms? What does “duration” measure — minutes, hours, or days? These details should be recorded in a data dictionary (see below).
Validated ranges. Values should fall within expected ranges. A human age of 250 or a negative salary indicates a data entry error. Validation checks — even simple ones like summary() or range() — should be run early and often.
5.2.3 Project Organization and Data Documentation
Good data preparation extends beyond the data itself to how you organize your project and document your work. In professional BI settings, analyses must be reproducible — a colleague (or your future self) should be able to understand and re-run your work months later.
Directory structure. A well-organized project separates raw data, processed data, scripts, and output. A common structure:
project/
|-- data/
| |-- raw/ # Original, untouched data files
| +-- processed/ # Cleaned, analysis-ready data
|-- scripts/ # R scripts for cleaning and analysis
|-- output/ # Reports, plots, exported results
+-- README.md # Project description and instructions
The key principle is to never modify raw data files directly. Always read the raw data into R, apply transformations in code, and save the cleaned result to a separate file. This way, the raw data is always available as a reference, and every cleaning decision is documented in your script.
Data dictionaries. A data dictionary (or codebook) is a document that describes every variable in a dataset: its name, data type, units, allowed values, and a plain-language description. For the Absenteeism at Work dataset used in this textbook, for example, a data dictionary would note that Reason for absence is a categorical variable coded using ICD (International Classification of Diseases) categories, and that Absenteeism time in hours is a non-negative integer.
Documentation in code. Use comments in your R scripts to explain why you made each cleaning decision — not just what the code does. For example: # Remove 3 records where age > 100 — confirmed data entry errors with HR department. This context is invaluable when revisiting the analysis later.