Author: Giang Son Nguyen.
Date: 11.08.2021
Visit my portfolio at: giangson.tech
This case study is part of the Google Data Analytics Capstone Project on Coursera.
The data used in this project is the public Divvy trip history dataset.
In this project, I will be acting as a junior data analyst for a fictional company Cyclistic. My task is to perform 6 steps in the data analysis process and produce a report with the corresponding deliverables:
The tools I will use for this project are:
# Import libraries
shhh <- suppressPackageStartupMessages #do this to hide ("supress") the warnings when loading libraries
start_time <- Sys.time() #measure notebook runtime
shhh(library(rvest)) #for webscraping
shhh(library(tidyverse)) #for working with data
shhh(library(lubridate)) #for working with datetime data
shhh(library(data.table)) #for faster aggregating
library(ggthemes) #for custom ggplot themes
Briefing
Cyclistic is a company based in Chicago, US. It provides bide-sharing service with 5800 bikes and 600 docking stations. Cyclistic categorizes its customers into 2 types:
The finance team concludes that annual members bring significantly larger profits than casual riders. To maximize the number of annual members, the marketing team wants to launch a campaign that covert casual customers into members.
To help achieve that goal, I - the data analyst - need to analyze the trip data from the past year, generate usefull insights, and provide my suggestions for further actions.
Business objetive
Derive a marketing program to convert Cyclistic's casual riders into annual members to increase profits.
My task
Analyze the data and provide an answer to the question:
How do annual members and casual riders use Cyclistic bikes differently?
Cyclistic stores all their trip data in a single folder, which can be accessed at this URL: https://divvy-tripdata.s3.amazonaws.com/index.html. The data is collected at a ride-level: each row contains information (detailed below) about one single trip.
The recent data is organized into monthly files (each file contains trip data for a single month). All of these files contain 13 columns, the names of which are consistent accross all files. A description of these columns are presented in the next section.
Previously (before April 2020), Cyclistic used to sort its data differently (quartery/ annually) and used different attributes. I won't go into details since this information is mostly irrelevant.
For the purpose of this case study, I'll perform analysis on the data from the lastest 12 months at the time of writing (June 2020 to May 2021).
I use the below script to scrape the data from its source and save it onto my computer. Alternatively, this task could be done manually without taking too much time.
# Scrape the source website for links to the last 12 months of data.
root <- "https://divvy-tripdata.s3.amazonaws.com/"
index <- read_html(root)
# Get the zip file names: elements with the tag <key> and the word tripdata
zip_names <- index %>%
html_nodes("key") %>%
html_text()
list()
zip_names <- grep("tripdata", zip_names, value=TRUE)
# Only get 12 months of data
zip_names <- tail(zip_names, 12)
#Load the data directly from the source, then save as csv files
for (zip_name in zip_names){
# assign the url to the zip file
url <- paste(root, zip_name,sep="")
# assign the name of the actual csv file
csv_name <- paste(substring(zip_name,1,nchar(zip_name)-4), ".csv",sep="")
# load the data from the csv inside the zip file
temp <- tempfile()
download.file(url, temp)
montly_data <- read_csv(unz(temp, csv_name),
col_types = cols(end_station_id = col_character(),
start_station_id = col_character())) #have to specify column types
# save the monthly data as csv
write_csv(montly_data, paste('data',csv_name,sep="/"))
}
print("Finished downloading all files.")
# Define a function to read csv files later
# Have to do this because I need to specify the column types while reading to avoid errors
read_files <- function(file){
read_csv(file, col_types = cols(end_station_id = col_character(),
start_station_id = col_character()))
}
# Get the directories to the csv files
files <- list.files(path = "data/", pattern = ".*csv")
# Read all the files and combine them into one dataframe
df_list <- lapply(paste("data/",files,sep=""), read_files)
all_data <- bind_rows(df_list)
# Save a copy of the 12-month data
write_csv(all_data, 'all-tripdata.csv')
About the dataset
A quick glance at the dataset shows that this dataset consists of:
Checking for issues
started_at
> ended_at
) and will interfere with the analysis. They must be removed.# Load all data
all_data <- read_csv('all-tripdata.csv',
col_types = cols(end_station_id = col_character(),
start_station_id = col_character()))
print(paste("This data has", dim(all_data)[1], "rows and", dim(all_data)[2],"columns"))
print("First 6 rows:")
head(all_data) #See the first 6 rows of data frame.
# check for duplicates
# no duplicate
# sum(duplicated(data))
# check for missing values
# missing values in the name, id and location of stations
# can be ignored
colSums(is.na(all_data))
To make the data analysis-ready, I've performed some processing steps, including:
Since the data manipulation is rather extensive, I decide to perform the changes on a copy of the dataset instead of using the original.
Below is the list of all the changes that have been made:
started_at
> ended_at
ride_length
(= ended_at
- started_at
)hour
(hour of the day of started_at
day_of_week
(the day of the week of started_at
)month
(the month of started_at
)The new dataset contains: 4063225 rows and 17 columns.
# remove bad data and save as a new dataframe
data <- all_data[!(all_data$started_at > all_data$ended_at),]
# see how many rows were removed
rows_removed <- dim(all_data)[1] - dim(data)[1]
paste(rows_removed, "rows have been removed.")
# month column
data$month <- lubridate::month(data$started_at, label = TRUE)
# day_of_week column
data$day_of_week <- lubridate::wday(data$started_at, label = TRUE, week_start = 1)
# hour column
data$hour <- lubridate::hour(data$started_at)
# calculate ride_length (in seconds)
data$ride_length <- difftime(data$ended_at,data$started_at)
head(data$ride_length,10)
data$ride_length <- as.numeric(as.character(data$ride_length))
# inspecting the new dataframe
print(paste("The new dataframe has", dim(data)[1], "rows and", dim(data)[2],"columns"))
head(data)
Based on the available data, I will analyze 2 key metrics:
and break them down by:
overall <- setDT(data)[,list(mean_ride_length = mean(ride_length),
number_of_rides = .N)]
overall
by_membership <- setDT(data)[,list(mean_ride_length = mean(ride_length),
total_ride_length = sum(ride_length),
number_of_rides = .N,
percent_of_rides = .N/dim(data)[1]),
by = 'member_casual']
by_membership
by_membership %>%
arrange(desc(member_casual)) %>%
mutate(percent_of_rides = percent_of_rides*100) %>%
mutate(ypos = cumsum(percent_of_rides) - percent_of_rides*0.5) %>%
# making a pie chart in ggplot is a bit tricky
ggplot(aes(x="", y=percent_of_rides, fill=member_casual)) +
ggtitle("Percentage of rides by membership") +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
geom_text(aes(y = ypos, label = paste0(round(percent_of_rides, 2),"%")), color = "white", size=6) +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_blank(),
axis.line=element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank())
by_bike_type <- setDT(data)[,list(mean_ride_length = mean(ride_length), number_of_rides = .N, percent_of_rides = .N/dim(data)[1]),
by = 'rideable_type']
by_bike_type
by_membership_and_bike_type <- setDT(data)[order(member_casual),
list(mean_ride_length = mean(ride_length),
number_of_rides = .N),
by = 'member_casual,rideable_type']
by_membership_and_bike_type
ggplot(data = by_membership_and_bike_type, aes(x = rideable_type, y = number_of_rides, fill = member_casual))+
geom_col() +
labs(title="Number of rides by type of membership and type of bike", x = "Type of bike", y = "Number of rides") +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist()
by_membership_and_month <- setDT(data)[order(month,member_casual),
list(number_of_rides = .N,
mean_ride_length = mean(ride_length)),
by = 'month,member_casual']
by_membership_and_month
ggplot(data = by_membership_and_month, aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title="Number of rides by type of membership and month", x = "Month", y = "Number of rides") +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist()
by_membership_and_dayofweek <- setDT(data)[order(day_of_week,member_casual),
list(number_of_rides = .N,
mean_ride_length = mean(ride_length)),
by = 'member_casual,day_of_week']
# Alternatively, this could be done with dplyr pipe operations - but that's much slower with a dataset as large as this one.
# create a column for weekday/weekend
weekdays <- c('Mon', 'Tue','Wed', 'Thu', 'Fri')
by_membership_and_dayofweek <- by_membership_and_dayofweek %>%
mutate(weekday = if_else(day_of_week %in% weekdays, "weekday", "weekend"))
by_membership_and_dayofweek
ggplot(data = by_membership_and_dayofweek, aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title="Number of rides by type of membership and day of the week", x = "Day of the week", y = "Number of rides") +
annotate(geom="text",x=6.55,y=4.25e+05, label = "More trips by casual riders", color = 'blue', size = 3.5, fontface = 2) +
annotate(geom="text",x=6.5,y=4.1e+05, label = "on weekends", color = 'blue', size = 3.5, fontface = 2) +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
geom_vline(xintercept = 5.5, linetype="dashed", color ="red", size = 0.5) +
theme_economist()
ggplot(data = by_membership_and_dayofweek, aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title="Number of rides by type of membership and weekday/weekend", x = "Weekday/ Weekend", y = "Number of rides") +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist()
ggplot(data = by_membership_and_dayofweek, aes(x = day_of_week, y = mean_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title="Ride duration by type of membership and day of the week", x = "Day of the week", y = "Average ride duration (seconds)") +
scale_fill_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist()
by_membership_and_hour <- setDT(data)[order(hour,member_casual),
list(number_of_rides = .N, mean_ride_length = mean(ride_length)),
by = 'member_casual,hour']
by_membership_and_hour
by_membership_and_hour %>%
group_by(member_casual) %>%
summarize(max_ride_length = max(mean_ride_length),
min_ride_length = min(mean_ride_length))
ggplot(data = by_membership_and_hour, aes(x = hour, y = number_of_rides, color = member_casual)) +
geom_line() +
geom_point() +
labs(title="Number of rides by type of membership and hour", x = "Hour", y = "Number of rides") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 23),expand = c(0, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_color_discrete(name = "Type of membership", labels = c("Casual riders", "Annual members")) +
theme_economist()
by_membership_and_start_station <- setDT(data[start_station_name != 'NA'])[order(start_station_name,member_casual),
list(number_of_rides = .N),
by = 'start_station_name,member_casual']
head(by_membership_and_start_station)
top_start_station_casual <- head(by_membership_and_start_station[member_casual == 'casual'][order(-number_of_rides)], 10)
top_start_station_casual
top_start_station_member <- head(by_membership_and_start_station[member_casual == 'member'][order(-number_of_rides)], 10)
top_start_station_member
ggplot(data = top_start_station_casual,
aes (x = reorder(start_station_name, -number_of_rides),
y = number_of_rides, fill = member_casual)) +
geom_col() +
labs(title="Most popular start stations for casual riders",
x = "", y = "Number of rides") +
scale_fill_discrete(name = "Type of membership",
labels = c("Casual riders", "Annual members")) +
theme_economist() +
theme(axis.text.x = element_text(angle = 90),
axis.line=element_blank(),
axis.ticks.x = element_blank())
by_membership_and_end_station <- setDT(data[end_station_name != 'NA'])[order(end_station_name,member_casual),
list(number_of_rides = .N),
by = 'end_station_name,member_casual']
head(by_membership_and_end_station)
top_end_station_casual <- head(by_membership_and_end_station[member_casual == 'casual'][order(-number_of_rides)], 10)
top_end_station_casual
top_end_station_member <- head(by_membership_and_end_station[member_casual == 'member'][order(-number_of_rides)], 10)
top_end_station_member
ggplot(data = top_end_station_casual,
aes (x = reorder(end_station_name, -number_of_rides),
y = number_of_rides, fill = member_casual)) +
geom_col() +
labs(title="Most popular end stations for casual riders",
x = "", y = "Number of rides") +
scale_fill_discrete(name = "Type of membership",
labels = c("Casual riders", "Annual members")) +
theme_economist() +
theme(axis.text.x = element_text(angle = 90),
axis.line=element_blank(),
axis.ticks.x = element_blank())
Overview of the last 12 months
Detailed breakdown
There was signifcant differences between top start and end stations of casual riders and members.
Top stations (both start and end) for casual riders include:
While top stations for annual members are:
Note: At the time of writing (Jul - Aug 2021), I plan to use Tableau for better visualizations but cannot follow through yet because I'm in the middle of my finals exams. Maybe I will return to this project once I have the time. For now, these rough ggplot2 graphs will have to do.
# save csv files for later visualizations in Tableau
write_csv(overall, "viz_data/overall.csv")
write_csv(by_membership, "viz_data/by_membership.csv")
write_csv(by_bike_type, "viz_data/by_bike_type.csv")
write_csv(by_membership_and_bike_type, "viz_data/by_membership_and_bike_type.csv")
write_csv(by_membership_and_bike_type, "viz_data/by_membership_and_month.csv")
write_csv(by_membership_and_bike_type, "viz_data/by_membership_and_dayofweek.csv")
write_csv(by_membership_and_bike_type, "viz_data/by_membership_and_hour.csv")
Based on my analysis, here are 3 main suggestions for converting casual riders into annual members:
Create a "weekend" membership plan: Since the number of casual rides is spectacularly high at the weekend, a plan tailored to that demand (with a lower price than the standard plan) would incentivize casual riders to sign up for a membership.
Introduce point reward system based on ride length: Riders can accumulate points based on their past ride duration, and convert those points to get a discount when buying/ renewing their membership. For example, each 100 minutes ridden can be counted as 1 point / 1 dollar, so a customers who has ridden 300 minutes can get a 3 dollars discount (this number is only illustrative).
Run marketing campaigns targeted at top stations for casual riders: Since the top stations for casual riders are very different from those of members, campaigns at these areas are very likely to reach casual riders. The marketing team could design custom ads, or offer specialized benefits for riders who frequent these top stations.
My work in this project has been complete. By analyzing the number of rides and ride time and breaking them down by different categories, I was able to find some very useful insights. I have also given 3 suggestions to the marketing team based on my findings.
Suggestion for further work: In this project, I could only analyze data on a trip level. If I could access customer data, my analysis could be more insightful (ie: How does each customer uses Cyclistic's services?). Moreover, additional information such as Cyclistic's membership pricing and more details on customer's behavior could also help producing more valuable recommendations.
end_time <- Sys.time()
elapsed_time <- end_time - start_time
elapsed_time