4 Case Study: Data Preparation and Exploration

PRELIMINARY AND INCOMPLETE

In this case study, we will explore the process of analyzing a cleaned dataset from the Google Play Store, which contains information on various mobile applications available on the platform. The goal of this exercise is to cleanse, explore, and organize the data to prepare it for deeper analysis, allowing us to gain insights into app performance, user preferences, and market trends.

4.1 Dataset Overview

This dataset was sourced from Kaggle and has been preprocessed to remove inconsistencies, handle missing values, and standardize formats for easier analysis. The original dataset can be found here. It includes the following columns:

App Name: The name of the mobile application.
Category: The category under which the app is listed on the Play Store.
Rating: The user rating of the app (on a scale from 0 to 5).
Reviews: The number of user reviews.
Size: The size of the app (e.g., in MB).
Installs: The number of times the app has been installed.
Type: Whether the app is free or paid.
Price: The price of the app, if paid.
Content Rating: The age group suitable for the app.
Genres: The genres associated with the app.
Last Updated: The last date the app was updated.
Current Version: The current version of the app.
Android Version: The minimum Android OS version required to run the app.

4.2 Case Study Objectives

The primary objective of this case study is to guide you through the process of loading, cleaning, and preparing a dataset for analysis. By the end of this exercise, you will have developed the skills to explore key features of the dataset, understand the distribution of app ratings, pricing, and other variables, and aggregate and summarize data to extract meaningful insights about the Google Play Store app market.

4.3 Loading the Data

The first step in our analysis is to load the dataset into R. We assume that the dataset is stored in a CSV file named google_play_store.csv. To load the dataset, we use the read.csv() function, which reads the data into R and stores it in a data frame called play_store_data. This initial step is crucial as it allows us to bring the raw data into the R environment where it can be manipulated and analyzed.

# Loading the dataset into R
play_store_data <- read.csv("data/google_play_store.csv")

The read.csv() function takes the path to the CSV file as its argument and reads the data into a data frame format, which is a fundamental data structure in R that organizes data into rows and columns. This structure is ideal for data analysis tasks.

4.4 Processing the Data

4.4.1 Handling Missing Values

Once the data is loaded, our next task is to handle any missing values in the dataset. Missing data can lead to inaccuracies in analysis and skewed results if not properly addressed. To check for missing values, we use the is.na() function, which identifies any NA values in the dataset. After identifying the missing values, a common approach is to replace them with the mean value of the respective column, especially if the missing data is in a numeric column like Rating. This method, while simple, helps maintain the integrity of the dataset by filling gaps with a representative value, thus allowing for a more accurate and complete analysis.

# Checking for missing values in the dataset
sum(is.na(play_store_data))

[1] 0

# Replacing missing values in the 'Rating' column with the mean rating
play_store_data$Rating <- ifelse(is.na(play_store_data$Rating), 
                                 mean(play_store_data$Rating, na.rm = TRUE), 
                                 play_store_data$Rating)

In the code above, the sum(is.na(play_store_data)) line counts the total number of missing values in the entire dataset. The next block of code replaces any missing values in the Rating column with the mean of the non-missing ratings. The ifelse() function checks each value in the Rating column: if it is NA, it is replaced with the mean rating; otherwise, the original value is retained.

4.4.2 Converting Data Types

It is also essential to ensure that each column in the dataset has the appropriate data type for analysis. For example, columns like Installs, Reviews, and Price might initially be read as character strings due to the presence of non-numeric characters like commas, plus signs, or dollar signs. These columns need to be converted to numeric data types to allow for meaningful calculations and comparisons. To achieve this, we use functions like gsub() to remove non-numeric characters and as.numeric() to convert the cleaned strings into numbers. This step is critical because incorrect data types can lead to errors in subsequent analysis steps, such as filtering and aggregating data.

# Removing non-numeric characters and converting 'Installs' to numeric
play_store_data$Installs <- as.numeric(gsub("[+,]", "", play_store_data$Installs))

# Converting 'Price' to numeric after removing the dollar sign
play_store_data$Price <- as.numeric(gsub("\\$", "", play_store_data$Price))

# Converting 'Reviews' to numeric
play_store_data$Reviews <- as.numeric(play_store_data$Reviews)

In these lines of code, gsub() is used to remove unwanted characters from the Installs and Price columns. For example, gsub("[+,]", "", play_store_data$Installs) removes commas and plus signs, and gsub("\\$", "", play_store_data$Price) removes the dollar sign. The cleaned strings are then converted to numeric format using as.numeric(), making them suitable for arithmetic operations and comparisons.

4.4.3 Filtering Data

To focus our analysis on a specific subset of the data, we may apply filters to the dataset. In the context of this case study, we choose to filter the dataset to include only free apps that have been installed more than 1,000,000 times. This filtering narrows our analysis to highly popular free apps, allowing us to gain insights into what makes these apps successful. Filtering is an important technique in data analysis as it helps isolate the most relevant data, making the analysis more targeted and meaningful.

# Filtering the dataset for free apps with more than 1,000,000 installs
filtered_data <- play_store_data[play_store_data$Type == "Free" & 
                                 play_store_data$Installs > 1000000, ]

Here, the dataset is filtered using a logical condition that checks whether the app is free (Type == "Free") and whether it has more than 1,000,000 installs (Installs > 1000000). The result is stored in a new data frame called filtered_data, which contains only the rows that meet these criteria.

4.4.4 Aggregating Data by Category

After filtering the data, the next step is to aggregate the data to understand trends and patterns within specific categories. For example, we might want to calculate the average user rating and the total number of installs for each app category. This can be done using the aggregate() function, which groups the data by the Category column and applies functions to calculate summary statistics such as the mean rating and the total installs. Aggregation is a powerful tool in data analysis as it allows us to condense large datasets into more manageable summaries that highlight key insights.

# Aggregating data to calculate the mean rating and total installs per category
category_summary <- aggregate(cbind(Rating, Installs) ~ Category, 
                              data = filtered_data, 
                              FUN = function(x) c(mean = mean(x), sum = sum(x)))

In this code, the aggregate() function is used to group the data by Category. The cbind(Rating, Installs) function binds the Rating and Installs columns together so that the aggregation functions can be applied to both simultaneously. The FUN argument specifies that for each category, we want to calculate the mean of the ratings and the sum of the installs. This results in a summary table where each row represents a category and contains the aggregated statistics for that category.

4.4.5 Summarizing Data

Once the data has been aggregated, it can be further summarized to provide a clear overview of the findings. By converting the aggregated results into a data frame, we can easily examine the summarized data, making it easier to interpret and report. This step is essential for presenting the results of the analysis in a clear and concise manner.

# Converting the aggregated data into a data frame for easier viewing
category_summary <- do.call(data.frame, category_summary)

The do.call(data.frame, category_summary) function converts the list output from the aggregate() function into a data frame format, making it easier to view and work with. This data frame contains the summarized statistics for each app category, providing a condensed view of the key metrics.

4.5 Examining the Final Data Frame

After processing, filtering, and summarizing the data, it is important to examine the final data frame to ensure that it is ready for analysis. The head() function is commonly used to display the first few rows of the data frame, allowing us to quickly inspect the structure and contents of the processed data. This final check helps verify that all steps have been carried out correctly and that the data is in the desired format for further analysis or reporting.

# Displaying the first few rows of the final summarized data
head(category_summary)

Category	Rating.mean	Rating.sum	Installs.mean	Installs.sum
ART_AND_DESIGN	4.43	39.9	1.28e+07	1.15e+08
AUTO_AND_VEHICLES	4.38	21.9	7e+06	3.5e+07
BEAUTY	4.25	8.5	7.5e+06	1.5e+07
BOOKS_AND_REFERENCE	4.4	180	4.61e+07	1.89e+09
BUSINESS	4.26	217	1.85e+07	9.45e+08
COMICS	4.2	25.2	6.67e+06	4e+07

Using head(category_summary), we can view the first six rows of the summarized data. This allows us to confirm that the data has been processed as expected and that it is ready for any further analysis or reporting tasks.

4.6 Saving the Processed Data

Finally, the processed and summarized data can be saved to a CSV file using the write.csv() function. This allows the cleaned and aggregated data to be easily shared with others or used in future analyses. Saving the data is a crucial step in the data analysis workflow as it preserves the results of your work and makes them accessible for future reference or collaboration.

# Saving the processed data to a CSV file
write.csv(category_summary, "google_play_store_summary.csv", row.names = FALSE)

The write.csv() function writes the data frame to a CSV file, specified by the file name "google_play_store_summary.csv". The row.names = FALSE argument ensures that the row names are not included in the CSV file, which helps keep the file clean and focused on the data itself.

4.7 Conclusion

This case study demonstrates the entire process of loading, cleaning, and preparing a dataset from the Google Play Store for analysis. By following these steps, the dataset is transformed from its raw state into a structured format that is ready for business analysis. This preparation is essential for deriving meaningful insights into app performance and market trends, which can inform decisions in areas such as app development, marketing, and user engagement strategies.

4.8 Homework Assignment

4.8.1 Objective

The objective of this homework assignment is to reinforce the data processing and analysis techniques covered in the case study by applying them to a new exploration of the Google Play Store dataset. Through this exercise, you will gain hands-on experience with data manipulation, filtering, and aggregation in R, which will deepen your understanding of how to prepare and analyze real-world datasets.

4.8.2 Instructions

To complete this assignment, start by loading the google_play_store.csv dataset into R, as demonstrated in the case study. Here is code to load this dataset.

play_store_data <-
  read.csv(
    "https://ljkelly3141.github.io/real-world-statistics-with-r/data/google_play_store.csv"
    )

Next, verify that key columns in the dataset, such as Installs, Reviews, and Price, are correctly formatted as numeric data types. This step is crucial because it ensures that subsequent analysis steps, such as filtering and aggregation, produce accurate results. If any of these columns are read as characters, you will need to convert them to numeric types, removing any non-numeric characters such as commas or dollar signs.

For this assignment, you will apply a different filter from the one used in the case study. Instead of filtering for free apps with more than 1,000,000 installs, you will narrow your analysis to include only apps in the “Everyone” content rating category that have a user rating of 4.0 or higher. This filter will allow you to explore a subset of apps that are widely accessible and highly rated by users.

After filtering the data, aggregate the information by Content Rating instead of by Category. Calculate the average user rating and the total number of installs for each content rating group. This will help you identify trends and patterns within the different content rating categories on the Google Play Store, providing insights into user preferences and app performance across various age groups.

Once you have completed these steps, convert the aggregated data into a data frame and examine the results to ensure they align with your expectations. Finally, save the cleaned and processed data to a CSV file for future reference or sharing.

4.8.3 Submission Instructions

Please follow these detailed instructions for completing and submitting your assignment. This assignment is to be conducted within the class assignment workspace provided to you. You will create an R Quarto document, incorporating code and analysis as demonstrated in the provided examples. Follow the structure provided in the example code and explanations to guide your analysis. Each required step should correspond to a separate section within your R Quarto document. Utilize the headings feature in Quarto to organize your document (# for main sections, ## for subsections). Once you have completed the analysis and are satisfied with your document, compile it into MS Word document and submit it as instructed.