5.8 AI in Data Preparation

AI tools are increasingly useful in the data preparation phase of BI — the very phase where analysts spend the most time. While AI cannot replace the judgment needed to decide how data should be cleaned, it can dramatically accelerate the process of writing the code to do it (Zhu et al. 2025).

5.8.1 How AI Assists with Data Cleaning

An AI coding assistant like Claude Code can help with data preparation tasks in several ways:

Generating cleaning code from descriptions: Describe the problem (“the date column has mixed formats — some are MM/DD/YYYY and others are YYYY-MM-DD”) and the AI generates R code to standardize them.
Detecting anomalies: Ask the AI to examine a dataset for outliers, unexpected data types, or values outside expected ranges.
Suggesting transformations: Given a messy dataset, AI can suggest a sequence of tidyverse operations to reshape it into tidy format.
Explaining existing code: When inheriting a cleaning pipeline from a colleague, AI can walk through each step and explain what it does.

5.8.2 Limitations

AI tools have important limitations in data preparation. They may suggest plausible-sounding transformations that subtly alter the meaning of the data — for example, imputing missing values with the mean when the missingness is not random, or dropping duplicates that are actually legitimate repeated measurements. The analyst must understand the data’s context and verify that AI-suggested cleaning operations are appropriate (Naeem et al. 2024).

AI is best used as an accelerator for data preparation, not a substitute for understanding your data. The dplyr, tidyr, and cleaning techniques covered in this chapter give you the vocabulary to evaluate and modify AI-generated cleaning code — and to know when the AI has made a mistake.

Example: AI-Assisted Data Cleaning

Prompt to Claude Code: The temperature column has some NA values and the humidity column has negative values that shouldn't be possible. Clean both issues.

AI-generated code:

sensor_data_clean <- sensor_data |>
  mutate(temperature = ifelse(is.na(temperature),
                              median(temperature, na.rm = TRUE),
                              temperature)) |>
  filter(humidity >= 0 | is.na(humidity))

What the student should verify: The AI chose median imputation (reasonable for potentially skewed sensor data) and filtered out negative humidity values while preserving NAs. But is filtering the right choice, or should negative values be investigated first? Could they indicate a sensor malfunction that affects other columns too? The AI made defensible choices, but the analyst must decide whether they are appropriate for this context.