5.11 Glossary of Terms
Anti Join: A join that returns rows from the left dataset that have no match in the right dataset, keeping only left columns.
Data Cleaning: The process of identifying and correcting errors, inconsistencies, missing values, and duplicates in a dataset to improve data quality.
Data Dictionary: A document describing every variable in a dataset — its name, data type, units, allowed values, and plain-language definition. Also called a codebook.
Data Frame: R’s primary structure for tabular data, consisting of columns (variables) of equal length.
Data Tidying: The practice of structuring a dataset so that each variable forms a column, each observation forms a row, and each value occupies a single cell.
dplyr: A tidyverse package providing a consistent set of “verbs” for data manipulation, including
filter(),select(),mutate(),summarize(), andgroup_by().Duplicate: A row in a dataset that is identical to another row across all columns. Duplicates should be investigated and, if confirmed as errors, removed with
distinct(). Some duplicates may represent legitimate repeated measurements.Full Join: A join that returns all rows from both datasets, filling unmatched values with
NA.group_by(): A dplyr function that segments a data frame into groups based on one or more variables, so that subsequent operations (like
summarize()) are computed within each group.Imputation: The practice of replacing missing values with estimated values (e.g., the mean or median) rather than removing the affected rows.
Inner Join: A join that returns only rows with matching values in both datasets.
Left Join: A join that returns all rows from the left dataset, with matching rows from the right dataset. Unmatched rows have
NAin the right columns.Messy Data: Any dataset arrangement that violates the principles of tidy data. Common forms include variables stored in rows, multiple variables in a single column, or values spread across column names.
Missing Value (NA): R’s representation of a missing or unavailable data point. Functions like
mean()returnNAby default when input contains missing values unlessna.rm = TRUEis specified.Pipe Operator (
|>): An operator that passes the result of one expression as the first argument to the next function, enabling readable chains of data operations.pivot_longer(): A tidyr function that transforms data from wide format to long format by gathering column names into a variable column and their values into a value column.
pivot_wider(): A tidyr function that transforms data from long format to wide format by spreading a variable’s values across multiple columns.
Reproducibility: The ability for others (or your future self) to re-run an analysis and obtain the same results. Achieved through organized project structure, documented code, and separation of raw and processed data.
Right Join: A join that returns all rows from the right dataset, with matching rows from the left dataset.
Semi Join: A join that returns rows from the left dataset that have a match in the right dataset, but includes only columns from the left dataset.
SQL (Structured Query Language): The standard language for querying and manipulating data in relational databases. SQL operations like
SELECT,WHERE,GROUP BY, andJOINmap directly todplyrfunctions.Tidy Data: A dataset structured so that each variable forms a column, each observation forms a row, and each value occupies a single cell. This standard simplifies analysis in R’s tidyverse ecosystem.
tidyr: A tidyverse package for reshaping data between wide and long formats using
pivot_longer()andpivot_wider().Tidyverse: An ecosystem of R packages designed for data science, sharing a consistent philosophy, grammar, and data structure. Core packages include dplyr, tidyr, ggplot2, and readr.
Validation: The process of checking that data values fall within expected ranges and conform to expected types, used to catch data entry errors and anomalies early in the preparation process.