5.7 Data Cleaning Techniques

Even after structuring data into tidy format, real-world datasets typically contain missing values and duplicate records that must be addressed before analysis. We demonstrate these techniques using a simulated sensor dataset with intentionally introduced problems.

5.7.1 Creating a Sample DataFrame

# Creating a sample DataFrame with missing values and duplicates
set.seed(123)  # For reproducibility
sensor_data <- data.frame(
  timestamp = seq(as.POSIXct("2024-01-01"), by = "hour", length.out = 20),
  temperature = rnorm(20, mean = 20, sd = 5),
  humidity = rnorm(20, mean = 50, sd = 10)
)

# Introducing missing values randomly
sensor_data[sample(1:20, 5), "temperature"] <- NA
sensor_data[sample(1:20, 5), "humidity"] <- NA

# Adding duplicate rows
sensor_data <- rbind(sensor_data, sensor_data[sample(1:20, 5), ])


sensor_data

This sensor_data frame contains hourly temperature and humidity readings with randomly introduced NA values and duplicate rows — a realistic scenario for any BI dataset.

5.7.2 Handling Missing Values

The two main strategies for missing data are removal and imputation. The right choice depends on why the data is missing:

Missing completely at random (MCAR): The missingness has no relationship to the data. Example: a sensor randomly fails. Removal or imputation are both reasonable.
Missing not at random (MNAR): The missingness is related to the missing value itself. Example: high-income individuals decline to report their salary. Removal introduces bias; imputation requires careful modeling.

In practice, most missing data falls somewhere between these extremes. The key is to investigate the pattern of missingness before choosing a strategy — never assume that drop_na() or mean imputation is safe without checking.

5.7.2.1 Removing Rows with Missing Values

drop_na() removes any row containing at least one NA. This is appropriate when the proportion of missing data is small and the missingness appears random.

# Removing rows with missing values
sensor_data_clean <- drop_na(sensor_data)
sensor_data_clean

5.7.2.2 Replacing Missing Values

Alternatively, missing values can be replaced with a specific value like the mean or median, which helps maintain the dataset size.

# Replacing missing values with the mean
mean_temp <- mean(sensor_data$temperature, na.rm = TRUE)
sensor_data_clean <- sensor_data |>
  mutate(
    temperature_mean = ifelse(is.na(temperature),mean_temp,temperature)
  )
sensor_data_clean

In this case, missing temperature values are replaced with the column’s mean temperature, computed while ignoring NAs with na.rm = TRUE. For skewed distributions, the median is often a better choice than the mean, as it is less influenced by extreme values. Replace mean() with median() in the code above to use median imputation instead.

5.7.3 Dealing with Duplicates

5.7.3.1 Removing Duplicate Rows

distinct() removes duplicate rows, keeping only the first occurrence of each.

# Removing duplicate rows to retain only unique observations
sensor_data_unique <- distinct(sensor_data)
sensor_data_unique

5.7.4 Cleaning Text Data

Messy text data — inconsistent capitalization, trailing whitespace, typos in category names — is extremely common in business datasets. The stringr package (loaded with the tidyverse) provides functions for common string operations:

# Sample messy text data
departments <- c("  Sales", "sales", "SALES", "Marketing ", "marketing")

# Trim whitespace and convert to lowercase
departments |> str_trim() |> str_to_lower()

## [1] "sales"     "sales"     "sales"     "marketing" "marketing"

Other useful stringr functions include str_replace() for find-and-replace operations and str_detect() for filtering rows that match a text pattern.