13.5 Validation-First Analytics

Validation-first analytics inverts the typical workflow: instead of building an analysis and then checking whether it is correct, you define what “correct” means before you write any code. This is analogous to test-driven development (TDD) in software engineering, where tests are written before the code that must pass them (Beck 2003). Systems like Amazon’s Deequ have demonstrated that “unit tests for data” — declarative validation checks applied to data pipelines — can be automated at scale (Schelter et al. 2018).

13.5.1 Writing Validation Checks

Validation checks are specific, testable conditions that the analysis must satisfy. They fall into several categories:

Data validation — checks applied to the cleaned dataset before modeling:

Does the dataset have the expected number of rows and columns?
Are all required variables present?
Do variable types match the data specification?
Are values within expected ranges?
Is the proportion of missing values within acceptable limits?

Analysis validation — checks applied to the model or analysis output:

Does the model meet the minimum performance threshold specified in the requirements?
Are the results consistent with domain knowledge?
Do aggregations match known totals or benchmarks?

Deliverable validation — checks applied to the final output:

Does the report or dashboard answer every key question from the requirements spec?
Are all required visualizations included?
Is the documentation complete?

13.5.2 Validation in R

Validation checks can be implemented as R functions that return pass/fail results. Here is an example using the absenteeism dataset:

# Validation function for the absenteeism dataset
validate_absenteeism <- function(df) {
  checks <- list(
    "Has at least 700 rows" = nrow(df) >= 700,
    "Has ID column" = "ID" %in% names(df),
    "Has absence hours column" = "Absenteeism.time.in.hours" %in% names(df),
    "Age range 18-70" = all(df$Age >= 18 & df$Age <= 70, na.rm = TRUE),
    "BMI range 15-50" = all(df$Body.mass.index >= 15 &
                            df$Body.mass.index <= 50, na.rm = TRUE),
    "No negative absence hours" = all(df$Absenteeism.time.in.hours >= 0,
                                      na.rm = TRUE),
    "Less than 5% missing in any column" = all(
      sapply(df, function(x) mean(is.na(x))) < 0.05
    )
  )

  data.frame(
    Check = names(checks),
    Pass = unlist(checks)
  )
}

# Run validation on the raw dataset
validate_absenteeism(absenteeism)

Check	Pass
Has at least 700 rows	TRUE
Has ID column	TRUE
Has absence hours column	TRUE
Age range 18-70	TRUE
BMI range 15-50	TRUE
No negative absence hours	TRUE
Less than 5% missing in any column	TRUE

Running this function produces a clear pass/fail report. If any check fails, the analyst investigates before proceeding — preventing errors from propagating through the analysis.

13.5.3 AI-Assisted Validation

AI can generate validation scripts directly from a data specification. The analyst describes the expected conditions, and the AI produces an R function customized to those criteria. This is one of the most practical applications of AI in the spec-driven workflow — it translates a human-readable spec into executable code with minimal manual effort.