Introduction to data frames, a widely used data structure in R.
How to create, access, and modify data frames.
Practical examples of working with business datasets in data frames.
1. Introduction to Data Frames
A data frame is a two-dimensional data structure in R where:
Each column represents a different variable, and columns can store different data types (numeric, character, logical, etc.).
Each row represents an observation or record.
Data frames are similar to tables in a spreadsheet or SQL database, making them ideal for business data analysis.
Key Uses of Data Frames:
Organizing datasets for analysis.
Storing business data such as sales records, employee details, customer information, etc.
Data cleaning and transformation for analysis.
2. Creating Data Frames
You can create a data frame using the data.frame() function, where each argument represents a column in the data frame.
Example: Creating a Sales Data Frame
# Creating a data frame for sales datasales_data <-data.frame(Product =c("A", "B", "C", "D", "E"),Sales_Q1 =c(120, 150, 90, 100, 130),Sales_Q2 =c(170, 200, 140, 180, 190))sales_data
Product Sales_Q1 Sales_Q2
1 A 120 170
2 B 150 200
3 C 90 140
4 D 100 180
5 E 130 190
Explanation: In this example, sales_data is a data frame where:
The Product column contains product names.
Sales_Q1 and Sales_Q2 columns contain sales figures for the first and second quarters.
Checking the Structure of a Data Frame
You can inspect the structure of a data frame using the str() function, which shows the types of each column.
str(sales_data)
'data.frame': 5 obs. of 3 variables:
$ Product : chr "A" "B" "C" "D" ...
$ Sales_Q1: num 120 150 90 100 130
$ Sales_Q2: num 170 200 140 180 190
Explanation: This code reveals the structure of sales_data, showing that Product is a character column, while Sales_Q1 and Sales_Q2 are numeric columns.
3. Accessing Data in a Data Frame
Accessing Columns in a Data Frame
You can access individual columns in a data frame using the $ operator or square brackets.
Example: Accessing a Single Column
# Accessing the Sales_Q1 columnsales_data$Sales_Q1
[1] 120 150 90 100 130
Explanation: This retrieves the Sales_Q1 column from the sales_data data frame.
Example: Accessing Multiple Columns
# Accessing the Product and Sales_Q2 columnssales_data[, c("Product", "Sales_Q2")]
Product Sales_Q2
1 A 170
2 B 200
3 C 140
4 D 180
5 E 190
Explanation: This extracts the Product and Sales_Q2 columns, showing both in the output.
Accessing Rows in a Data Frame
You can access rows by specifying row indices using square brackets [row, ].
Example: Accessing a Specific Row
# Accessing the first row of the data framesales_data[1, ]
Product Sales_Q1 Sales_Q2
1 A 120 170
Explanation: This code retrieves the first row, showing all the details for Product A in the first quarter (Q1) and second quarter (Q2).
4. Modifying Data in a Data Frame
You can update specific elements, rows, or columns of a data frame by assigning new values.
Example: Modifying an Element in a Data Frame
# Changing the sales value for Product B in Q2sales_data[2, "Sales_Q2"] <-210sales_data
Product Sales_Q1 Sales_Q2
1 A 120 170
2 B 150 210
3 C 90 140
4 D 100 180
5 E 130 190
Explanation: This updates the sales figure for Product B in the second quarter from 200 to 210.
Example: Adding a New Column
# Adding a new column for Sales in Q3sales_data$Sales_Q3 <-c(190, 220, 150, 200, 210)sales_data
Product Sales_Q1 Sales_Q2 Sales_Q3
1 A 120 170 190
2 B 150 210 220
3 C 90 140 150
4 D 100 180 200
5 E 130 190 210
Explanation: This adds a new column, Sales_Q3, to the sales_data data frame with sales figures for the third quarter.
5. Practical Example: Analyzing Sales Data
Let’s calculate the total sales across all three quarters for each product by creating a new column.
Example: Calculating Total Sales
# Creating a new column for total sales across all quarterssales_data$Total_Sales <- sales_data$Sales_Q1 + sales_data$Sales_Q2 + sales_data$Sales_Q3sales_data
Product Sales_Q1 Sales_Q2 Sales_Q3 Total_Sales
1 A 120 170 190 480
2 B 150 210 220 580
3 C 90 140 150 380
4 D 100 180 200 480
5 E 130 190 210 530
Explanation: This code adds a new column Total_Sales that sums up sales from Sales_Q1, Sales_Q2, and Sales_Q3 for each product.
6. Sorting and Filtering Data in a Data Frame
Sorting Data Frames
You can sort a data frame by one or more columns using the order() function.
Example: Sorting by Total Sales in Descending Order
# Sorting the data frame by Total_Sales in descending ordersales_data_sorted <- sales_data[order(-sales_data$Total_Sales), ]sales_data_sorted
Product Sales_Q1 Sales_Q2 Sales_Q3 Total_Sales
2 B 150 210 220 580
5 E 130 190 210 530
1 A 120 170 190 480
4 D 100 180 200 480
3 C 90 140 150 380
Explanation: This code sorts the sales_data by Total_Sales in descending order, with the highest sales appearing first.
Filtering Data Frames
You can filter rows of a data frame based on specific conditions.
Example: Filtering Products with Sales Greater Than 150 in Q1
# Filtering products with Sales_Q1 greater than 150high_sales_q1 <- sales_data[sales_data$Sales_Q1 >150, ]high_sales_q1
Explanation: This code filters the sales_data to show only the products with sales greater than 150 in the first quarter.
Key Takeaways
Data frames are one of the most important data structures in R for handling datasets with different types of data in columns.
You can easily create, access, modify, sort, and filter data frames in R.
Data frames are widely used in business analytics for organizing, analyzing, and transforming data.
Looking Forward
In the next lecture, we’ll explore lists and factors, two other important data structures in R, and how to work with categorical data and complex collections of data.