3.9 Working with Data Frames, Lists, and Formulas

3.9.1 Loading and Saving Data

R provides functions to read data from and write data to external files. The most common format is CSV.

# Load data from a CSV file
data <- read.csv("data.csv")

# Save data to a CSV file
write.csv(data, "output.csv")

In later chapters, we use read_delim() and related functions from the readr package (part of the tidyverse), which offer more control over how data is parsed. The base R functions shown here work well for simple cases.

3.9.2 Factors

Factors represent categorical data in R. They are important for statistical modeling because R treats factors differently from plain text — for example, in regression models, factors are automatically converted to indicator variables.

# Create a factor
gender <- factor(c("Male", "Female", "Male", "Male"))
print(gender)
## [1] Male   Female Male   Male  
## Levels: Female Male

3.9.3 Data Frames

Data frames are R’s primary structure for tabular data — like a spreadsheet where each column is a vector. Most data you work with in BI will be stored in data frames.

# Create a data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 35),
  salary = c(50000, 60000, 70000)
)
print(df)
##      name age salary
## 1   Alice  25  50000
## 2     Bob  30  60000
## 3 Charlie  35  70000

3.9.4 Lists

Lists can hold elements of different data types — vectors, data frames, and even other lists. They are useful for organizing complex results (e.g., the output of a statistical model is typically a list).

# Create a list
my_list <- list(
  name = "John",
  age = 30,
  hobbies = c("reading", "playing guitar"),
  address = data.frame(street = "123 Main St", city = "Anytown")
)
print(my_list)
## $name
## [1] "John"
## 
## $age
## [1] 30
## 
## $hobbies
## [1] "reading"        "playing guitar"
## 
## $address
##        street    city
## 1 123 Main St Anytown

3.9.5 Formulas

Formulas specify relationships between variables using the ~ symbol. They are used extensively in statistical modeling — the left side is the response variable and the right side lists the predictors.

# Create a formula: mpg depends on cyl and disp
my_formula <- mpg ~ cyl + disp
print(my_formula)
## mpg ~ cyl + disp