13.5 Validation-First Analytics

Validation-first analytics inverts the typical workflow: instead of building an analysis and then checking whether it is correct, you define what “correct” means before you write any code. This is analogous to test-driven development (TDD) in software engineering, where tests are written before the code that must pass them (Beck 2003). Systems like Amazon’s Deequ have demonstrated that “unit tests for data” — declarative validation checks applied to data pipelines — can be automated at scale (Schelter et al. 2018).

13.5.1 Writing Validation Checks

Validation checks are specific, testable conditions that the analysis must satisfy. They fall into several categories:

Data validation — checks applied to the cleaned dataset before modeling:

  • Does the dataset have the expected number of rows and columns?
  • Are all required variables present?
  • Do variable types match the data specification?
  • Are values within expected ranges?
  • Is the proportion of missing values within acceptable limits?

Analysis validation — checks applied to the model or analysis output:

  • Does the model meet the minimum performance threshold specified in the requirements?
  • Are the results consistent with domain knowledge?
  • Do aggregations match known totals or benchmarks?

Deliverable validation — checks applied to the final output:

  • Does the report or dashboard answer every key question from the requirements spec?
  • Are all required visualizations included?
  • Is the documentation complete?

13.5.2 Validation in R

Validation checks can be implemented as R functions that return pass/fail results. Here is an example using the absenteeism dataset:

# Validation function for the absenteeism dataset
validate_absenteeism <- function(df) {
  checks <- list(
    "Has at least 700 rows" = nrow(df) >= 700,
    "Has ID column" = "ID" %in% names(df),
    "Has absence hours column" = "Absenteeism.time.in.hours" %in% names(df),
    "Age range 18-70" = all(df$Age >= 18 & df$Age <= 70, na.rm = TRUE),
    "BMI range 15-50" = all(df$Body.mass.index >= 15 &
                            df$Body.mass.index <= 50, na.rm = TRUE),
    "No negative absence hours" = all(df$Absenteeism.time.in.hours >= 0,
                                      na.rm = TRUE),
    "Less than 5% missing in any column" = all(
      sapply(df, function(x) mean(is.na(x))) < 0.05
    )
  )

  data.frame(
    Check = names(checks),
    Pass = unlist(checks)
  )
}

# Run validation on the raw dataset
validate_absenteeism(absenteeism)
CheckPass
Has at least 700 rowsTRUE
Has ID columnTRUE
Has absence hours columnTRUE
Age range 18-70TRUE
BMI range 15-50TRUE
No negative absence hoursTRUE
Less than 5% missing in any columnTRUE

Running this function produces a clear pass/fail report. If any check fails, the analyst investigates before proceeding — preventing errors from propagating through the analysis.

13.5.3 AI-Assisted Validation

AI can generate validation scripts directly from a data specification. The analyst describes the expected conditions, and the AI produces an R function customized to those criteria. This is one of the most practical applications of AI in the spec-driven workflow — it translates a human-readable spec into executable code with minimal manual effort.