6 Case Study: Simple Linear Regression

PRELIMINARY AND INCOMPLETE

6.1 Introduction

In this case study, we will explore how simple linear regression can be used to analyze car market data. Specifically, we aim to model the relationship between a vehicle’s horsepower and its energy efficiency, measured in miles per gallon (MPG). This analysis can provide valuable insights into optimizing vehicle performance, balancing the trade-offs between power and fuel economy, and understanding how engine characteristics impact energy efficiency.

6.2 Objective

The primary objective of this case study is to demonstrate how to use simple linear regression to predict energy efficiency (MPG) based on horsepower. Through this process, we will: - Introduce the fundamental concepts of regression analysis. - Highlight the steps involved in building, diagnosing, and refining a regression model. - Investigate potential outliers and assess their impact on the model’s validity.

6.3 Workflow of Regression Analysis

The workflow for performing a regression analysis typically involves the following key steps:

Data Preparation: The first step is loading and preparing the dataset, which involves selecting relevant variables and handling missing or incomplete data. For this case study, we will focus on two key variables: horsepower (the independent variable) and energy efficiency (MPG) (the dependent variable).
Exploratory Data Analysis (EDA): Next, we conduct an initial examination of the data to understand relationships and potential trends. This includes creating scatter plots to visualize the relationship between the variables and identifying any potential outliers that could affect the analysis.
Model Building: Once the data has been explored, we move on to fitting a simple linear regression model. This involves estimating the coefficients of the regression equation, which quantifies the relationship between the independent and dependent variables.
Model Diagnostics: After building the model, it is essential to check its assumptions and evaluate its performance. This step involves analyzing residuals (the difference between observed and predicted values) to ensure that the model is appropriate. We will also check the normality of residuals to confirm that the model assumptions are met.
Model Refinement: If necessary, we refine the model by removing outliers, transforming variables, or considering alternative model structures to improve the fit and predictive power.
Interpretation and Reporting: Finally, we interpret the results of the regression model, summarizing the key findings and providing actionable insights. For this case study, we will determine how horsepower influences energy efficiency and highlight any outliers that may affect this relationship.

Through this case study, you will learn how to apply simple linear regression to real-world car market data, diagnose potential issues with the model, and interpret the results for informed decision-making.

6.4 Dataset Overview

The dataset used in this case study contains a mixture of both numeric and categorical variables, simulating realistic vehicle data that might be used by manufacturers and analysts. This data, while simulated, is designed to represent key factors in vehicle performance, customer preferences, and cost.

The variables in the dataset cover a wide range of attributes, from technical specifications such as horsepower and engine size to customer-related data like satisfaction ratings and cost of ownership. These variables allow for a detailed analysis of the relationships between various vehicle characteristics and outcomes such as energy efficiency.

Below is an overview of the relevant variables for this case study:

Variable Name	Description	Data Type	Categorical Type
`Brand`	Manufacturer of the vehicle (e.g., Tesla, Toyota, Ford)	Categorical	Nominal
`Model`	Specific model of the vehicle (e.g., Camry, Model 3)	Categorical	Nominal
`Trim`	Version or package level of the vehicle model	Categorical	Nominal
`Trim_Level`	Luxury or feature set of the vehicle (Base, Medium, Premium)	Categorical	Ordinal
`Style`	Body type of the vehicle (Sedan, SUV, Pickup)	Categorical	Nominal
`Size`	Size class of the vehicle (Compact, Midsize, Full-size)	Categorical	Ordinal
`MSRP (USD)`	Manufacturer’s Suggested Retail Price in U.S. dollars	Numeric	-
`Energy Efficiency (MPG)`	Fuel efficiency measured in miles per gallon (MPG)	Numeric	-
`Horsepower`	Power output of the vehicle, measured in horsepower (HP)	Numeric	-
`Engine Size (L)`	Engine displacement measured in liters	Numeric	-
`Customer Rating`	Customer satisfaction rating (out of 5 stars)	Numeric	-
`Safety Rating`	Safety rating based on crash test performance (1-5 stars)	Numeric	-
`Hybrid`	Whether the vehicle is a hybrid (Hybrid/Non-Hybrid)	Categorical	Nominal
`Electric`	Whether the vehicle is fully electric (Electric/Non-Electric)	Categorical	Nominal
`Four_Wheel_Drive`	Indicates if the vehicle has 4WD (4WD/2WD)	Categorical	Nominal
`Sunroof`	Whether the vehicle has a sunroof (Yes/No)	Categorical	Nominal
`Bluetooth`	Whether the vehicle has Bluetooth connectivity (Yes/No)	Categorical	Nominal
`Backup Camera`	Whether the vehicle has a backup camera (Yes/No)	Categorical	Nominal
`Main Market`	The primary market where the vehicle is sold (North America/Europe)	Categorical	Nominal
`Average Annual Cost of Ownership (USD)`	Estimated total annual cost of owning the vehicle in USD	Numeric	-

Data Characteristics

Categorical Data: Variables such as Brand, Model, Trim, and Main Market are nominal, representing different categories without any inherent ranking. Others, like Trim_Level and Size, are ordinal, indicating an ordered relationship (e.g., Trim_Level ranging from Base to Premium).
Numeric Data: Variables like MSRP, Horsepower, and Energy Efficiency (MPG) are continuous, representing quantifiable values essential for the analysis of vehicle performance and customer costs.

In the next section, we will prepare the dataset for analysis, focusing on selecting the relevant columns and cleaning the data to ensure a robust regression model.

6.5 Data Loading and Preparation

In this section, we will load the necessary packages, import the dataset, and clean the data to prepare it for analysis. The dataset contains vehicle characteristics, including horsepower and energy efficiency, which will be the primary variables for our regression analysis. Additionally, we will leverage summary statistics to ensure that the dataset is properly structured and ready for modeling.

Note: Readers can download the dataset used in this analysis from https://ljkelly3141.github.io/real-world-statistics-with-r/data/car_price.xlsx.

6.5.1 Loading Necessary Packages

Before loading and cleaning the data, we need to install and load the required R packages for data handling and statistical analysis. The readxl package is used to read Excel files, while the psych package provides a variety of functions for data summary and descriptive statistics.

# Load necessary libraries
library(readxl)  # For reading Excel files
library(psych)   # For descriptive statistics

Breaking Down the Code

library(readxl): Loads the readxl package, which allows us to read Excel files into R. This is crucial for importing the dataset, as our data is stored in an Excel format.
library(psych): Loads the psych package, which provides powerful functions for descriptive statistics, such as the describe() function, which gives a detailed summary of the dataset, including means, standard deviations, and skewness.

6.5.2 Loading and Cleaning Data

Once the packages are loaded, we proceed with importing the car dataset from an Excel file. After loading the data, we will clean it by selecting only the relevant columns (in this case, horsepower and energy efficiency) and removing any rows with missing values. Additionally, we will generate summary statistics to ensure the data is ready for analysis.

# Load the dataset from an Excel file
car_data <- read_excel("data/car_price.xlsx")

# Preview the first few rows of the dataset to understand its structure
head(car_data)

Brand	Model	Trim	Trim Level	Style	Size	MSRP (USD)	Energy Efficiency (MPG)	Horsepower	Engine Size (L)	Customer Rating	Safety Rating	Hybrid	Electric	Four_Wheel_Drive	Sunroof	Bluetooth	Backup_Camera	Main Market	Average Annual Cost of Ownership (USD)
Toyota	Camry	LE	Base	Sedan	Midsize	2.9e+04	32	203	2.5	4.5	5	Non-Hybrid	Non-Electric	2WD	Sunroof	Bluetooth	Backup Camera	North America	6.2e+03
Toyota	Camry	XSE	Medium	Sedan	Midsize	3.4e+04	31	301	3.5	4.7	5	Non-Hybrid	Non-Electric	2WD	Sunroof	Bluetooth	Backup Camera	North America	6.4e+03
Toyota	Camry	Hybrid	Premium	Sedan	Midsize	3.7e+04	50	208	2.5	4.8	5	Hybrid		2WD	Sunroof	Bluetooth	Backup Camera	North America	5.8e+03
Ford	F-150	XLT	Base	Pickup	Full-size	5.2e+04	20	290	3.3	4.4	5	Non-Hybrid		4WD		Bluetooth	Backup Camera	North America	9.1e+03
Ford	F-150	Lariat	Medium	Pickup	Full-size	6.1e+04	18	400	5	4.6	5	Non-Hybrid		4WD	Sunroof	Bluetooth	Backup Camera	North America	9.5e+03
Ford	F-150	Platinum	Premium	Pickup	Full-size	7.2e+04	18	400	5	4.8	5	Non-Hybrid		4WD	Sunroof	Bluetooth	Backup Camera	North America	9.8e+03

# Select relevant columns: 'Brand', 'Model', 'Horsepower', and 'Energy Efficiency (MPG)'
car_data <- car_data[, c("Brand", "Model", "Horsepower", "Energy Efficiency (MPG)")]

# Remove rows with missing values
car_data <- na.omit(car_data)

# Display summary statistics of the selected data
describe(car_data, skew = FALSE, omit = TRUE)

vars	n	mean	sd	median	min	max	range	se
3	44	259	84.8	255	115	420	305	12.8
4	44	33.2	22.1	29	16	130	114	3.33

Breaking Down the Code

car_data <- read_excel("data/car_price.xlsx"): Reads the Excel file car_price.xlsx and stores the data as a data frame in the variable car_data.
head(car_data): Displays the first few rows of the dataset to provide a preview of its structure and ensure that the data has been successfully loaded.
car_data <- car_data[, c("Brand", "Horsepower", "Energy Efficiency (MPG)")]: Selects the relevant columns—Brand, Horsepower, and Energy Efficiency (MPG)—from the dataset for our analysis. This step ensures we focus on the necessary data attributes.
car_data <- na.omit(car_data): Removes any rows containing missing values to ensure that the dataset is complete and ready for analysis.
describe(car_data, skew = FALSE, omit = TRUE): This function from the psych package provides detailed descriptive statistics for the selected columns. It returns important metrics such as mean, standard deviation, and minimum and maximum values for numeric variables, which will help ensure data integrity.

Next, we will perform exploratory data analysis (EDA) to visualize the relationships between horsepower and energy efficiency, and identify any potential outliers.

6.6 Exploratory Data Analysis (EDA)

In this section, we will explore the relationship between horsepower and energy efficiency (MPG) through scatter plots. We will start by visualizing the raw data and then use the Tukey method to identify and highlight outliers. Afterward, we will remove the outliers and replot the data to reassess the relationship.

6.6.1 Initial Scatter Plot

We begin by creating a scatter plot to visualize the relationship between horsepower and energy efficiency (MPG) for the vehicles in the dataset.

# Scatter plot of horsepower vs energy efficiency
plot(car_data$Horsepower, car_data$`Energy Efficiency (MPG)`, 
     main = "Scatter Plot of Horsepower vs Energy Efficiency",
     xlab = "Horsepower (HP)",
     ylab = "Miles Per Gallon (MPG)",
     pch = 19, col = "blue")

Breaking Down the Code

plot(car_data$Horsepower, car_data$Energy Efficiency (MPG)): Creates a scatter plot showing the relationship between horsepower and energy efficiency (MPG). Each point on the plot represents a vehicle, with horsepower on the x-axis and energy efficiency (MPG) on the y-axis. The points are colored blue to distinguish them from the outliers that will be highlighted in the next step.

6.6.2 Identifying and Highlighting Outliers (Tukey Method)

Next, we use the Tukey method to identify potential outliers in the Energy Efficiency (MPG) variable. The Tukey method defines outliers as values that fall outside of the interquartile range (IQR), specifically below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

# Calculate IQR and Tukey outlier thresholds
iqr <- IQR(car_data$`Energy Efficiency (MPG)`)
q1 <- quantile(car_data$`Energy Efficiency (MPG)`, 0.25)
q3 <- quantile(car_data$`Energy Efficiency (MPG)`, 0.75)
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Identify outliers
outliers_indices <- car_data$`Energy Efficiency (MPG)` < lower_bound | 
                    car_data$`Energy Efficiency (MPG)` > upper_bound

# Scatter plot with outliers highlighted
plot(car_data$Horsepower, car_data$`Energy Efficiency (MPG)`, 
     main = "Scatter Plot of Horsepower vs Energy Efficiency (Outliers Highlighted)",
     xlab = "Horsepower (HP)",
     ylab = "Miles Per Gallon (MPG)",
     pch = 19, col = "blue")

# Highlight outliers in red
points(car_data$Horsepower[outliers_indices], 
       car_data$`Energy Efficiency (MPG)`[outliers_indices], 
       col = "red", 
       pch = 19)

Breaking Down the Code

iqr <- IQR(car_data$Energy Efficiency (MPG)): Computes the interquartile range (IQR) for energy efficiency (MPG), which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
q1 <- quantile(car_data$Energy Efficiency (MPG), 0.25): Computes the 25th percentile (Q1) of energy efficiency (MPG) values.
q3 <- quantile(car_data$Energy Efficiency (MPG), 0.75): Computes the 75th percentile (Q3) of energy efficiency (MPG) values.
lower_bound <- q1 - 1.5 * iqr: Defines the lower threshold for outliers as Q1 minus 1.5 times the IQR. Any value below this threshold is considered a potential low outlier.
upper_bound <- q3 + 1.5 * iqr: Defines the upper threshold for outliers as Q3 plus 1.5 times the IQR. Any value above this threshold is considered a potential high outlier.
outliers_indices <- car_data$Energy Efficiency (MPG)< lower_bound | car_data$Energy Efficiency (MPG)> upper_bound: This line creates a logical vector, outliers_indices, that identifies whether each vehicle’s energy efficiency (MPG) is an outlier. The OR operator (|) checks if the value is either below the lower bound or above the upper bound. If either condition is met, the corresponding row is marked as TRUE (outlier), otherwise FALSE.
points(): This function is used to highlight the outliers identified by the Tukey method in red on the scatter plot. Each red point represents a vehicle that has been classified as an outlier in terms of energy efficiency.

6.6.3 Identifying Outlier Car Makes and Models

After identifying the outliers in the dataset, it is useful to investigate which car makes and models are classified as outliers based on their Energy Efficiency (MPG) values. By looking at these vehicles, we can determine if certain types of cars (e.g., electric vehicles or high-performance cars) are consistently being flagged as outliers.

# Display the makes and models of the cars identified as outliers
car_data[outliers_indices, ]

Brand	Model	Horsepower	Energy Efficiency (MPG)
Toyota	Camry	208	50
Tesla	Model 3	258	125
Tesla	Model 3	346	130
BMW	X5	389	50

Breaking Down the Code

car_data[outliers_indices, ]: This command extracts the rows from the dataset where the outliers were identified. The resulting data frame outlier_cars contains the make, model, horsepower, and energy efficiency values of the vehicles that were flagged as outliers.
Interpretation: In this case, we observe that vehicles such as the Tesla Model 3 are flagged as outliers due to their extremely high energy efficiency (MPG), while cars like the Toyota Camry and BMW X5 are identified as outliers with more typical fuel efficiency but lower MPG than expected for their horsepower.

By examining the brands and models of the outlier vehicles, we can identify whether certain types of cars, such as electric vehicles (like the Tesla Model 3), might skew the analysis due to their different energy efficiency characteristics compared to traditional gasoline vehicles.

6.6.4 Strategies for Dealing with Outliers

Outliers can have a significant impact on the results of a regression analysis. They may skew the relationship between the variables, distort model parameters, or influence the overall interpretation of the data. When faced with outliers, it is essential to carefully consider how to handle them to ensure robust and meaningful results. Below are several strategies for dealing with outliers in a dataset like the one we are analyzing:

Remove Outliers (Trimming the Data) One common approach is to remove the outliers from the dataset. This method is appropriate when the outliers are extreme and clearly result from data entry errors or represent rare, atypical cases that do not fit the general trend.

Advantages:
- Simplifies the dataset by focusing only on the central trend.
- Can improve model accuracy by reducing the influence of extreme values.
Disadvantages:
- Risk of losing valuable information if the outliers represent valid but rare cases.
- Removal of outliers could bias the results if the outliers are systematically related to the outcome of interest.
In this case, removing vehicles such as the Tesla Model 3 (which has very high MPG) might give a clearer view of the relationship between horsepower and energy efficiency for traditional gasoline-powered vehicles.
Transform the Data Another strategy is to apply a mathematical transformation to the data, such as a log transformation, to reduce the influence of outliers. This can be particularly useful when the relationship between variables is non-linear or when the distribution of the data is skewed.

Advantages:
- Retains all data points, including outliers.
- Can make the relationship between variables more linear, improving the fit of a regression model.
Disadvantages:
- Interpretation of transformed variables can be less intuitive.
- Some outliers may still exert significant influence, even after transformation.
For example, applying a log transformation to energy efficiency (MPG) might reduce the impact of electric vehicles like the Tesla Model 3, which have much higher efficiency compared to gasoline-powered vehicles.
Winsorize the Data Winsorization is a technique where extreme values are replaced by the nearest non-outlier values. This method is useful when you want to limit the influence of extreme values without completely removing the data points.

Advantages:
- Preserves all observations, including those classified as outliers.
- Reduces the influence of extreme outliers on the analysis.
Disadvantages:
- Artificially alters the data by changing the values of extreme points.
- May reduce the richness of the data, especially if outliers are meaningful observations.
In our case, we could Winsorize the energy efficiency values by capping the highest MPG at a more reasonable level (e.g., at the upper bound defined by the Tukey method).
Analyze Outliers Separately In some cases, it may be valuable to treat the outliers as a separate group for analysis. This approach allows you to preserve the insights from the outliers without allowing them to distort the overall analysis.

Advantages:
- Preserves the integrity of the dataset while still accounting for outliers.
- Allows for insights specific to the outlier group, which might reveal interesting patterns.
Disadvantages:
- Adds complexity to the analysis, as separate models or analyses may be required for different groups.
- Does not address how outliers interact with the rest of the data.
For instance, it may be useful to analyze electric vehicles, like the Tesla Model 3, separately from gasoline-powered vehicles, as their energy efficiency characteristics are fundamentally different.
Use Robust Regression Techniques Robust regression methods, such as quantile regression or regression with robust standard errors, can reduce the influence of outliers without the need to remove them or transform the data.

Advantages:
- No need to remove or alter data points.
- Provides reliable estimates even in the presence of outliers.
Disadvantages:
- Can be more complex to implement and interpret.
- May still not fully address the issue if there is a large number of outliers.
Robust regression techniques could allow us to model the relationship between horsepower and energy efficiency without the need to remove or Winsorize the extreme MPG values.

The choice of strategy for handling outliers depends on the nature of the data and the research objectives. In this case study, electric vehicles like the Tesla Model 3 are clear outliers due to their high energy efficiency (MPG). Depending on the focus of the analysis, we could remove these outliers to focus on traditional gasoline-powered vehicles, or we could use robust regression techniques to include all data points without letting the outliers distort the model. Ultimately, it is important to carefully consider the potential impact of outliers and choose an approach that aligns with the goals of the analysis. For this case study, we will remove the outliers. In latter chapter’s we will discuss other stratigies.

6.6.5 Removing Outliers and Replotting

After identifying the outliers, we will remove them from the dataset and create a new scatter plot to reassess the relationship between horsepower and energy efficiency (MPG) without the influence of these extreme values.

# Remove the outliers from the dataset
car_data_clean <- car_data[!(outliers_indices), ]

# Replot the data without outliers
plot(car_data_clean$Horsepower, car_data_clean$`Energy Efficiency (MPG)`, 
     main = "Scatter Plot of Horsepower vs Energy Efficiency (Outliers Removed)",
     xlab = "Horsepower (HP)",
     ylab = "Miles Per Gallon (MPG)",
     pch = 19, col = "blue")

Breaking Down the Code

car_data_clean <- car_data[!(outliers_indices), ]: Creates a cleaned dataset, car_data_clean, by excluding the rows identified as outliers. The ! negates the logical vector, so only the non-outlier rows (FALSE values) are retained.
plot(): Replots the cleaned data to show the relationship between horsepower and energy efficiency (MPG) without the influence of the outliers.

6.6.6 Evaluating Linearity

With the outliers removed, the scatter plot provides a clearer view of the relationship between horsepower and energy efficiency (MPG). Visually, the points appear to gave a linear relationship, though we can see a pattern called heteroscedasticity. We will discuss heteroscedasticity in a latter chapter, so for now the linearity assumption is met.

In the next section, we will proceed to fit a linear regression model to confirm whether the relationship is linear and evaluate the strength of this relationship.

6.7 Simple Linear Regression Model

We will now fit a simple linear regression model to predict energy efficiency (MPG) using horsepower as the predictor. This analysis will help us understand the relationship between these two variables and quantify the impact of horsepower on a vehicle’s fuel efficiency.

# Fit the linear regression model
model <- lm(`Energy Efficiency (MPG)` ~ Horsepower, data = car_data_clean)

# Summary of the model
summary(model)


Call:
lm(formula = `Energy Efficiency (MPG)` ~ Horsepower, data = car_data_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-9.968 -1.966  0.683  1.933  9.241 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 41.270812   1.864643  22.133  < 2e-16 ***
Horsepower  -0.053695   0.006956  -7.719 2.68e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.689 on 38 degrees of freedom
Multiple R-squared:  0.6106,    Adjusted R-squared:  0.6003 
F-statistic: 59.58 on 1 and 38 DF,  p-value: 2.677e-09

Breaking Down the Code

lm(): This function fits a simple linear regression model where horsepower is the independent variable (predictor) and energy efficiency (MPG) is the dependent variable (response).
summary(model): This function provides detailed output from the fitted model, including the estimated regression coefficients, R-squared value, and p-values, which will help us interpret the model’s effectiveness.

The next step is to generate and display the regression equation that quantifies the relationship between horsepower and energy efficiency (MPG).

The estimated regression equation is: \[\hat{MPG} = 41.271 - 0.054 \cdot Horsepower\]

Breaking Down the Code

coef(model): Extracts the coefficients from the fitted model. The first value is the intercept, and the second value is the slope (change in MPG for each unit change in horsepower).
cat(): This function is used to output the regression equation in LaTeX format. The equation represents the relationship between horsepower and energy efficiency (MPG), with a negative slope indicating that as horsepower increases, MPG decreases.

6.8 Model Diagnostics: Residual Analysis

After fitting the simple linear regression model, it is important to check whether the assumptions of linear regression hold. We begin by analyzing the residuals to ensure that the model is well-suited to the data.

A residual plot can help us determine if the residuals (the differences between observed and predicted values) are randomly distributed. Ideally, the residuals should scatter randomly around zero, indicating that the model’s predictions do not systematically deviate from the observed data. We can plot the residuals from the model and add a horizontal line at zero for reference:

# Plot the residuals
plot(model, which = 1)
abline(h = 0)

Breaking Down the Code

plot(model, which = 1): This command creates a residual plot, showing how the residuals (differences between observed and predicted values) are distributed relative to the fitted values. A well-fitting model should display residuals that scatter randomly around zero, without any clear patterns or trends.
abline(h = 0): Adds a horizontal line at zero to highlight deviations in the residuals. This reference line helps us assess whether the residuals are randomly distributed around zero, as expected in a good model fit.

The residual plot generated from the code shows how well the model captures the linear relationship between horsepower and energy efficiency (MPG). A well-fitting model would display no clear pattern or trend in the residuals, meaning they scatter randomly around zero. If the residuals exhibit a clear pattern (e.g., a curve or funnel shape), it may indicate problems with the model, such as non-linearity or heteroscedasticity.

Next, we check if the residuals follow a normal distribution. One of the assumptions of linear regression is that the residuals should be normally distributed. To evaluate this, we plot a histogram of the residuals:

# Plot a histogram of the residuals
hist(residuals(model), 
     main = "Histogram of Residuals", 
     xlab = "Residuals", 
     col = "lightblue", 
     breaks = 10)

Breaking Down the Code

hist(residuals(model)): This function generates a histogram of the residuals from the linear regression model. The residuals are the differences between observed and predicted values, and this histogram helps us check if they are normally distributed.
breaks = 10: Specifies that the histogram should have 10 bins. The number of bins controls the level of detail in the histogram and can help provide a clearer view of the residual distribution.
Interpretation: A bell-shaped curve in the histogram indicates that the residuals are normally distributed, which is one of the key assumptions of linear regression.

The histogram of residuals provides a visual check for normality. If the residuals are normally distributed, the histogram should form a bell-shaped curve. This would indicate that the residuals are symmetrically distributed around zero, which is important for ensuring valid statistical inferences from the model.

In summary, the residual plot should show random scatter around zero, suggesting that the model adequately captures the relationship between horsepower and energy efficiency without systematic bias. The histogram of residuals should approximate a bell-shaped curve, confirming the assumption of normality. If either of these diagnostics shows issues (e.g., patterns in the residuals or non-normality), further steps might be necessary, such as transforming the variables or considering a different model structure.

6.9 Conclusion

In this analysis, we modeled the relationship between horsepower and energy efficiency (MPG) using a simple linear regression. The results indicate a clear negative relationship between the two variables: as horsepower increases, energy efficiency decreases.

6.9.1 Key Findings:

Relationship:
- The negative coefficient of horsepower (-0.0537) suggests that for each additional unit of horsepower, energy efficiency (MPG) decreases by approximately 0.054 miles per gallon, holding all else constant.
- The p-value for the horsepower coefficient (2.68e-09) is highly significant, indicating a strong statistical relationship between horsepower and energy efficiency.
- The R-squared value of 0.6106 suggests that approximately 61.06% of the variability in energy efficiency (MPG) is explained by horsepower, which indicates a moderate fit for a simple linear regression model.

6.9.2 Addressing Outliers:

Outliers were identified and removed, which improved the accuracy and interpretability of the model.
The residual diagnostics suggest that the model captures much of the linear relationship, but some deviations and outliers remain, particularly among high-efficiency vehicles (e.g., electric or hybrid vehicles such as the Tesla Model 3).

6.9.3 Next Steps:

Non-linear Models: Explore whether non-linear models (e.g., quadratic regression or polynomial models) might better capture the relationship between horsepower and energy efficiency, particularly for high-performance or electric vehicles.
Outlier Analysis: Investigate the characteristics of the outliers to understand if they represent unique types of vehicles, such as electric or hybrid models, that behave differently from traditional gasoline vehicles.
Refining the Model: Address issues revealed through residual analysis, including potential transformations of variables or alternative models, to improve fit and residual behavior.

6.10 Summary of Packages and Functions Used

In this case study, we used several R packages and functions for data manipulation, model fitting, and diagnostic checking. Below is a summary of the packages, functions, the key arguments used, and links to the R documentation for each function.

6.10.1 Packages Used

readxl: Used to read Excel files into R. Key function: read_excel(), which reads data from an Excel file.
psych: Provides tools for descriptive statistics. Key function: describe(), which generates descriptive statistics for the variables.

6.10.2 Key Functions Used

read_excel("data/car_price.xlsx"): This function reads the dataset from an Excel file into a data frame in R. The key argument is the path to the Excel file, "data/car_price.xlsx", which specifies where the data file is located.

na.omit(): This function removes rows with missing values from the dataset. The argument passed is the data frame, ensuring that any rows containing NA values are excluded from further analysis.

plot(): This function is used to create scatter plots for visualizing relationships between variables. The main arguments include x for the independent variable (e.g., horsepower), y for the dependent variable (e.g., energy efficiency (MPG)), main for the title of the plot, xlab and ylab for axis labels, pch to specify the plotting symbol (e.g., 19 for filled circles), and col to define the color of the points (e.g., "blue").

IQR(): This function calculates the interquartile range (IQR) of a numeric variable. The argument is a numeric vector (e.g., Energy Efficiency (MPG)), and the function returns the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

quantile(): This function computes quantiles of a numeric variable, which are used to determine thresholds for identifying outliers. The arguments include x for the numeric vector and probs for specifying the probability at which to calculate the quantiles (e.g., 0.25 for Q1 and 0.75 for Q3).

points(): This function adds additional points (e.g., outliers) to an existing scatter plot. The key arguments are x for the x-coordinates (e.g., horsepower of outliers), y for the y-coordinates (e.g., energy efficiency (MPG) of outliers), col to specify the color of the points (e.g., "red"), and pch to set the plotting symbol (e.g., 19 for filled circles).

lm(): This function fits a linear regression model. The main arguments are formula, which specifies the relationship to model (e.g., Energy Efficiency (MPG) ~ Horsepower), and data, which defines the data frame containing the variables.

summary(): This function provides a detailed summary of a fitted regression model, including coefficients, R-squared values, and p-values. The argument passed is the fitted model object (e.g., model).

coef(): This function extracts the estimated coefficients (intercept and slope) from a fitted model. The argument is the fitted model object (e.g., model), and it returns the intercept and slope as numeric values.

residuals(): This function extracts the residuals from a fitted linear regression model, which are the differences between the observed and predicted values. The argument is the fitted model object (e.g., model).

plot(model, which = 1): This function creates a residual plot to assess the distribution of residuals and check for potential violations of the linear regression assumptions. The key arguments are model, which specifies the fitted model object, and which = 1, which generates the residuals vs. fitted values plot.

abline(): This function adds a reference line to a plot. The main arguments are h = 0, which adds a horizontal line at zero (useful for residual plots), and col, which specifies the color of the line (e.g., "red").

hist(): This function creates a histogram of the residuals to check for normality. The key arguments are x for the residuals, main for the plot title, xlab for labeling the x-axis, col for specifying the color of the bars (e.g., "lightblue"), and breaks for determining the number of bins in the histogram (e.g., 10).

6.11 Lecture Notes

Lecture 1: Introduction to Simple Linear Regression	html	pdf
Lecture 2: Exploratory Data Analysis (EDA)	html	pdf
Lecture 3: Fitting the Simple Linear Regression Model	html	pdf
Lecture 4: Residual Diagnostics - Evaluating Model Fit	html	pdf
Lecture 5: Interpreting the Model Results	html	pdf

6.12 Homework Assignment: Exploring the Relationship Between Engine Efficiency and MSRP

In this homework assignment, you will apply the simple linear regression workflow to explore the relationship between Energy Efficiency (MPG) and MSRP (USD) using the same vehicle dataset. This exercise will guide you through the steps of loading data, performing exploratory data analysis (EDA), fitting a simple linear regression model, and evaluating the model’s assumptions through residual diagnostics.

6.12.1 Objectives:

Load and clean the dataset.
Perform exploratory data analysis (EDA) to visualize the relationship between Energy Efficiency (MPG) and MSRP (USD).
Fit a simple linear regression model to analyze the relationship.
Perform residual diagnostics to assess the model’s assumptions.
Interpret the model results and discuss your findings.

6.12.2 Instructions:

1. Load the Dataset

Use the following code chunk to load the required package and dataset directly from the web:

# Install the openxlsx package if not installed
if (!require(openxlsx)) install.packages("openxlsx")
library(openxlsx)

# Read the Excel file directly from the URL
car_data <- read.xlsx(
  "https://ljkelly3141.github.io/real-world-statistics-with-r/data/car_price.xlsx",
  check.names = FALSE,
  sep.names = " "
)

# View the first few rows of the dataset
head(car_data)

Note

The switch to the openxlsx package from readxl is recommended because openxlsx allows for directly reading Excel files from URLs, whereas readxl only supports reading from local file paths. This eliminates the need to manually download files to a temporary location before reading them, making the process more efficient and reducing code complexity. Additionally, openxlsx is versatile, offering a range of functions for reading, writing, and manipulating Excel files, making it a more comprehensive solution for Excel data handling in R.

The following is an explanation of the code above:

Loading the openxlsx Package:

if (!require(openxlsx)) install.packages("openxlsx")
library(openxlsx)

if (!require(openxlsx)): This checks if the openxlsx package is already installed and loaded. If it is not installed, require(openxlsx) returns FALSE, and the install.packages("openxlsx") command installs the package.
library(openxlsx): After ensuring that the openxlsx package is installed, this command loads the package into the current R session so that its functions can be used.

Reading the Excel File:

car_data <- read.xlsx(
  "https://ljkelly3141.github.io/real-world-statistics-with-r/data/car_price.xlsx",
  check.names = FALSE,
  sep.names = " "
)

read.xlsx(): This function, from the openxlsx package, reads an Excel file directly from the specified URL.
“https://ljkelly3141.github.io/real-world-statistics-with-r/data/car_price.xlsx”: This is the URL from which the Excel file will be read.
check.names = FALSE: This argument tells R not to automatically adjust variable names in the dataset to make them syntactically valid. For instance, R won’t replace spaces or special characters in column names.
sep.names = ” “: This indicates that spaces should be preserved as separators in the column names, as opposed to replacing them with underscores or another character.

After loading the data, inspect the first few rows of the dataset to understand its structure and ensure it was loaded correctly.

2. Exploratory Data Analysis (EDA)

Before fitting a model, perform some initial exploratory data analysis to visualize the relationship between Energy Efficiency (MPG) and MSRP (USD).

Create a scatter plot to visualize the relationship between the two variables.
Observe the scatter plot. Does there appear to be a linear relationship between Energy Efficiency (MPG) and MSRP (USD)?

3. Fit a Simple Linear Regression Model

Fit a simple linear regression model where Energy Efficiency (MPG) is the predictor and MSRP (USD) is the response variable.

Interpret the output from the regression model.
What does the coefficient of Energy Efficiency (MPG) tell you about the relationship between energy efficiency and MSRP?
Is the relationship statistically significant? Use the p-values and R-squared values to support your answer.

4. Model Diagnostics: Residual Analysis

Perform residual analysis to evaluate whether the linear regression model meets its assumptions.

Plot the residuals to check for randomness.
Plot a histogram of the residuals to check for normality.
Are the residuals randomly scattered around zero in the residual plot? Does the histogram of residuals indicate that they are normally distributed?

5. Discussion and Interpretation

In your final analysis, address the following questions: - Based on the model output, does Energy Efficiency (MPG) have a significant effect on MSRP (USD)? - What can you conclude about the relationship between engine efficiency and vehicle price? - Do the residual plots suggest any issues with the model’s assumptions (e.g., linearity, homoscedasticity, normality)? - Suggest any further steps you could take if model assumptions are violated (e.g., transformations, using a different model).

6.12.3 Submission Instructions

Please follow these detailed instructions for completing and submitting your assignment. This assignment is to be conducted within the class assignment workspace provided to you. You will create an R Quarto document, incorporating code and analysis as demonstrated in the provided examples. Follow the structure provided in the example code and explanations to guide your analysis. Each required step should correspond to a separate section within your R Quarto document. Utilize the headings feature in Quarto to organize your document (# for main sections, ## for subsections).

Once you have completed the analysis and are satisfied with your document, compile it into an MS Word document and submit it as instructed.