5.11 Glossary of Terms

  1. Anti Join: A join that returns rows from the left dataset that have no match in the right dataset, keeping only left columns.

  2. Data Cleaning: The process of identifying and correcting errors, inconsistencies, missing values, and duplicates in a dataset to improve data quality.

  3. Data Dictionary: A document describing every variable in a dataset — its name, data type, units, allowed values, and plain-language definition. Also called a codebook.

  4. Data Frame: R’s primary structure for tabular data, consisting of columns (variables) of equal length.

  5. Data Tidying: The practice of structuring a dataset so that each variable forms a column, each observation forms a row, and each value occupies a single cell.

  6. dplyr: A tidyverse package providing a consistent set of “verbs” for data manipulation, including filter(), select(), mutate(), summarize(), and group_by().

  7. Duplicate: A row in a dataset that is identical to another row across all columns. Duplicates should be investigated and, if confirmed as errors, removed with distinct(). Some duplicates may represent legitimate repeated measurements.

  8. Full Join: A join that returns all rows from both datasets, filling unmatched values with NA.

  9. group_by(): A dplyr function that segments a data frame into groups based on one or more variables, so that subsequent operations (like summarize()) are computed within each group.

  10. Imputation: The practice of replacing missing values with estimated values (e.g., the mean or median) rather than removing the affected rows.

  11. Inner Join: A join that returns only rows with matching values in both datasets.

  12. Left Join: A join that returns all rows from the left dataset, with matching rows from the right dataset. Unmatched rows have NA in the right columns.

  13. Messy Data: Any dataset arrangement that violates the principles of tidy data. Common forms include variables stored in rows, multiple variables in a single column, or values spread across column names.

  14. Missing Value (NA): R’s representation of a missing or unavailable data point. Functions like mean() return NA by default when input contains missing values unless na.rm = TRUE is specified.

  15. Pipe Operator (|>): An operator that passes the result of one expression as the first argument to the next function, enabling readable chains of data operations.

  16. pivot_longer(): A tidyr function that transforms data from wide format to long format by gathering column names into a variable column and their values into a value column.

  17. pivot_wider(): A tidyr function that transforms data from long format to wide format by spreading a variable’s values across multiple columns.

  18. Reproducibility: The ability for others (or your future self) to re-run an analysis and obtain the same results. Achieved through organized project structure, documented code, and separation of raw and processed data.

  19. Right Join: A join that returns all rows from the right dataset, with matching rows from the left dataset.

  20. Semi Join: A join that returns rows from the left dataset that have a match in the right dataset, but includes only columns from the left dataset.

  21. SQL (Structured Query Language): The standard language for querying and manipulating data in relational databases. SQL operations like SELECT, WHERE, GROUP BY, and JOIN map directly to dplyr functions.

  22. Tidy Data: A dataset structured so that each variable forms a column, each observation forms a row, and each value occupies a single cell. This standard simplifies analysis in R’s tidyverse ecosystem.

  23. tidyr: A tidyverse package for reshaping data between wide and long formats using pivot_longer() and pivot_wider().

  24. Tidyverse: An ecosystem of R packages designed for data science, sharing a consistent philosophy, grammar, and data structure. Core packages include dplyr, tidyr, ggplot2, and readr.

  25. Validation: The process of checking that data values fall within expected ranges and conform to expected types, used to catch data entry errors and anomalies early in the preparation process.