13.4 Data Specifications and Quality Criteria

A data specification documents what data the project requires, where it comes from, what each variable means, and what quality standards it must meet. This is the bridge between the business requirements and the actual analysis — it ensures that the data the analyst works with is appropriate for answering the stakeholder’s questions.

13.4.1 Components of a Data Specification

Data sources. Where the data comes from — internal databases, CSV exports, APIs, third-party providers — and how it will be accessed. This should include any access restrictions or permissions required.

Data dictionary. A table describing every variable in the dataset: its name, data type, units, allowed values, and a plain-language definition. The data dictionary concept was introduced in Chapter 5; in the spec-driven workflow, it is created before the analysis begins, not after.

Required transformations. The cleaning and transformation steps that must be applied to the raw data — recoding variables, handling missing values, creating derived features. Documenting these in advance ensures reproducibility and makes the analyst’s decisions transparent.

Quality criteria. Minimum standards the data must meet before analysis proceeds. The foundational dimensions of data quality — accuracy, completeness, timeliness, and consistency — provide a useful framework for defining these criteria (Wang and Strong 1996; Cai and Zhu 2015):

  • Completeness: What percentage of missing values is acceptable per variable?
  • Accuracy: Are values within expected ranges? (e.g., age between 18 and 70, BMI between 15 and 50)
  • Timeliness: Is the data current enough for the analysis? (e.g., data must be from the past 3 years)
  • Consistency: Are formats standardized? (e.g., dates in ISO format, department names spelled consistently)

13.4.2 Worked Example: Partial Data Dictionary

Here is a partial data dictionary for the Absenteeism at Work dataset, showing the format and level of detail expected:

Variable Type Range/Values Description
ID Integer 1-36 Employee identifier
Reason.for.absence Categorical 0-28 Absence reason coded using ICD categories (0 = no reason recorded)
Month.of.absence Integer 1-12 Month in which absence occurred (0 = no absence)
Age Integer 27-58 Employee age in years
Body.mass.index Numeric 19-38 BMI (kg/m²)
Absenteeism.time.in.hours Integer 0-120 Hours absent in the recorded period

A complete data dictionary would include all 21 variables in the dataset. An AI assistant can generate a draft by inspecting the data file directly — the analyst then reviews and adds domain context (e.g., explaining that reason codes follow the ICD classification).

13.4.3 AI-Assisted Data Documentation

Creating a data dictionary manually is tedious but essential. AI can accelerate this by inspecting a dataset and generating a draft dictionary automatically. The R code below demonstrates the kind of summary that seeds this process:

# Generate a basic data summary that could seed a data dictionary
absenteeism <- read.csv("data/Absenteeism_at_work.csv", sep = ";")

data.frame(
  Variable = names(absenteeism),
  Type = sapply(absenteeism, class),
  Unique = sapply(absenteeism, function(x) length(unique(x))),
  Missing = sapply(absenteeism, function(x) sum(is.na(x))),
  Example = sapply(absenteeism, function(x) paste(head(unique(x), 3), collapse = ", "))
) |> head(8)
VariableTypeUniqueMissingExample
IDinteger36011, 36, 3
Reason.for.absenceinteger28026, 0, 23
Month.of.absenceinteger1307, 8, 9
Day.of.the.weekinteger503, 4, 5
Seasonsinteger401, 4, 2
Transportation.expenseinteger240289, 118, 179
Distance.from.Residence.to.Workinteger25036, 13, 51
Service.timeinteger18013, 18, 14

This output provides the raw material for a data dictionary. An AI assistant can take this summary and produce a formatted, human-readable dictionary with descriptions — which the analyst then reviews for accuracy.