Author: Giang Son Nguyen.
Date: 26.05.2021
In this project, I've been given 10-day sales data (from September 2018) of an undisclosed ecommerce website in Vietnam.
My objective is to perform several data analysis tasks (including one data wrangling task and one case study - more details on the tasks below) to gain key business insights. I'll also be providing some commentaries/ explanations wherever neccessary. The data is a bit limited, but the large volume of interactions still enable some interesting inferences.
Note: The product category names (merchant column) are in Vietnamese, so I will translate them when making interpretations about the data. Really sorry for this inconsistency 😥.
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from itertools import combinations, permutations
import time
prices_df = pd.read_csv('prices.csv')
sales_df = pd.read_csv('sales.csv')
df = pd.read_csv('case_study.csv',low_memory=False)
start_time= time.process_time()
prices_df.head()
sales_df.head()
# Checking the data types of each columns in the 2 dataframes shows that
# the *updated_at* and *ordered_at* columns are in string format.
print('price_df ' + '*'*20)
print(prices_df.dtypes)
print('sales_df ' + '*'*20)
print(sales_df.dtypes)
# Let's format them to datetime for easier processing.
prices_df['updated_at'] = pd.to_datetime(prices_df['updated_at'], infer_datetime_format = True)
sales_df['ordered_at'] = pd.to_datetime(sales_df['ordered_at'], infer_datetime_format = True)
# While we're at it, let's change the dtype for *product_id* too (from int to str).
prices_df['product_id'] = prices_df['product_id'].astype('str')
sales_df['product_id'] = sales_df['product_id'].astype('str')
# the results
print(prices_df.dtypes)
print(sales_df.dtypes)
# I also sort the prices_df by *updated_at*
# so it's easier to find the right price later on
prices_df_sorted = prices_df.sort_values('updated_at', ascending = False)
prices_df_sorted.head()
Explainining my logic:
The price for each product at the time of order will be determined in one of two ways:
So, all I have to do is:
revenue = price × quantity_ordered
. (I prefer my code to be clear so I decided to make this a separate step outside of the for loop)# finding the unit price.
sales_df['price'] = np.zeros(sales_df.shape[0])
for row in range(sales_df.shape[0]): # loop through each row
# find the product_id and order_at time
product_id = sales_df.iloc[row]['product_id']
ordered_at = sales_df.iloc[row]['ordered_at']
# the conditions for adjustments before 'ordered_at' (logic 1)
mask_new = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] < ordered_at)
# the conditions for adjustments after 'ordered_at' (logic 2)
mask_old = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] > ordered_at)
# find the adjustments in the prices_df
new_price = prices_df_sorted[mask_new]['new_price']
old_price = prices_df_sorted[mask_old]['old_price']
if len(new_price) > 0 :
# if there is adjustment before, than the price will be the new_price of the newest adjustment
price = new_price.iloc[0]
else:
# if there is no adjustment before, than the price will be the old_price of the oldest adjustment
price = old_price.iloc[-1]
sales_df['price'].iloc[row] = price
# calculate revenue
sales_df['revenue'] = sales_df['quantity_ordered'] * sales_df['price']
sales_df.head()
# Calculating ther revenue and quantity_order for each product at each price point.
# I also included the sales quantity for reference
sales_df.groupby(['product_id','price']).agg({'revenue':'sum','quantity_ordered':'sum'})
df.head()
df.shape
# I change the dtypes for some columns, similar to the first task
df['interaction_time'] = pd.to_datetime(df['interaction_time'])
df['fullVisitorid'] = df['fullVisitorid'].astype('str')
df['visitId'] = df['visitId'].astype('str')
df['productSKU'] = df['productSKU'].astype('str')
df['action_type'] = df['action_type'].astype('str')
df['transactionId'] = df['transactionId'].astype('str')
df.dropna(subset = ['merchant'], inplace = True)
df.shape
# the number of rows has decreased
# b. How many fullVisitorId added products to cart?
added_to_cart = df[df['action_type']== '3']['fullVisitorid'].nunique()
print("fullVisitorId added products to cart: ", added_to_cart)
# How many complete purchases?
completed_purchase = df[df['action_type']== '6']['fullVisitorid'].nunique()
print("fullVisitorId completed purchases: ", completed_purchase)
# Calculate these number in total by date and hour.
df['date'] = df['interaction_time'].dt.date
df['hour'] = df['interaction_time'].dt.hour
df[df['action_type']== '3'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})
df[df['action_type']== '6'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})
abandonment_rate = (added_to_cart - completed_purchase)/added_to_cart
print("The average abandonment rate is: ", round(abandonment_rate*100,2), "%")
# Create a df of added_to_cart action (action_type = 3), grouped by merchant
df_merchant_added_to_cart = df[df['action_type'] == '3'].groupby('merchant')[['merchant']].count()
df_merchant_added_to_cart.columns = ['added_to_cart']
# Create a df of completed_purchase action (action_type = 6), grouped by merchant
df_merchant_completed_purchase = df[df['action_type'] == '6'].groupby('merchant')[['merchant']].count()
df_merchant_completed_purchase.columns = ['completed_purchase']
# Merge tham together, than calculate the abandonment rate for each type of merchant
df_merchant = pd.merge(df_merchant_added_to_cart,df_merchant_completed_purchase, on = 'merchant').reset_index()
df_merchant['abandonment_rate'] = (df_merchant['added_to_cart'] - df_merchant['completed_purchase'])/df_merchant['added_to_cart']
# Sorting merchant by abandonment rate
df_merchant = df_merchant[df_merchant['abandonment_rate'] >0].sort_values('abandonment_rate', ascending = False)
df_merchant.head(10)
# Setting the style for sns plot. I personally prefer big plots so figsize will be (16,10)
sns.set(style='whitegrid',
palette="deep",
font_scale=1.1,
rc={"figure.figsize": [16, 10]}
)
# Plot the merchants whose abandonment rates are above average
ax = sns.barplot(data = df_merchant[df_merchant['abandonment_rate'] > abandonment_rate], x= 'abandonment_rate', y = 'merchant')
ax.set_title('Merchants with abandonment rate above average')
ax.set_xlabel("Abandonment rate")
ax.set_ylabel("Merchant (Product categories)")
plt.show()
Quick comment: The merchants with the highest abandonment rates are product categories that are typically high-priced (at least in Vietnam):
# In this question, we only care about transactions where
# the purchases have been completed and at least 2 types of products are involved
# which means, action_type = 6 the transactionId must appear in at least 2 different rows
# Let's create a dataframe consisting solely of the transactions that fit our description
df_duplicate = df[(df['transactionId'].duplicated(keep=False)) & (df['action_type']== '6') & (df['merchant'] != 'Promotion')][['transactionId','merchant']]
df_duplicate.drop_duplicates(inplace = True)
df_duplicate.head()
# Define a function that find the pairs of product
# You can fine more explaination here: https://youtu.be/iDV0Zh6HDGc
def find_pairs(x):
pairs = pd.DataFrame(list(permutations(x.values,2)), columns = ['item_1','item_2'])
return pairs
# Create a dataframe consisting of all the possible combo
df_combo = df_duplicate.groupby('transactionId')['merchant'].apply(find_pairs).reset_index(drop = True)
# Calculate the frequency of them being bought together
df_combo = df_combo.groupby(['item_1','item_2']).size().reset_index()
df_combo.columns = ['item_1','item_2','frequency']
df_combo.sort_values('frequency',ascending = False,inplace = True)
# Create a new column for the combo name
df_combo['combo']= df_combo['item_1'] + ' & ' + df_combo['item_2']
df_combo.head()
# You'll notice that each category is shown twice (once as item_1 & item_2 and once as item_2 & item_1)
# so we only need to take the even rows of that dataframe
df_combo = df_combo[['combo','frequency']][::2]
df_combo.head()
# Plotting the pairs and their frequency of being bought together
ax = sns.barplot(data = df_combo.head(10), y = 'combo', x = 'frequency')
ax.set_title("Top 10 product pairs most frequently sold together")
ax.set_ylabel("Pair of products")
ax.set_xlabel("Number of times bought together")
plt.show()
Quick comment: The most frequently brough together pair is: Vietnamese books & office supplies (pens, paper, etc...)
Surprisingly, Vietnamese books appear in 7 of the top 10 pairs. However, I wouldn't read too much into this because it's very likely that customers were buying books on this site and just happened to add something else in the same order (trust me, I've been there too).
In case user have several transactionIds, consider only product is added in the time interval between two consecutive transactions
df_purchases = df[df['action_type'] == '6'][['fullVisitorid','source','interaction_time','productSKU','merchant']]
df_purchases.head()
df_to_cart = df[df['action_type'] == '3'].groupby(['fullVisitorid','productSKU'])['interaction_time'].min().reset_index()
df_to_cart.head()
df_timelag = pd.merge(df_purchases,df_to_cart, how = 'left', on = ['fullVisitorid','productSKU'])
df_timelag.rename(columns = {'interaction_time_x': 'completed_purchase_time', 'interaction_time_y': 'added_to_cart_time'}, inplace = True)
# Calculate the time lag between the first added to cart product and the time of completing purchase
df_timelag['time_lag'] = df_timelag['completed_purchase_time'] - df_timelag['added_to_cart_time']
# Some orders don't have an add to cart time (?), so I assume the purchase is made immediately after the product is added to cart
# which means the time_lag for those transactions is 0
df_timelag['time_lag'].fillna(pd.Timedelta(seconds=0), inplace = True)
# convert the time_lag to minutes
df_timelag['time_lag'] = df_timelag['time_lag'].dt.total_seconds()/60
df_timelag.head()
print("Average time lag between first add-to-cart product and purchase time is: ", round(df_timelag['time_lag'].mean(),2), ' minutes')
# find the time lag for each merchant
time_lag_by_merchant = df_timelag.groupby('merchant')['time_lag'].mean().reset_index()
time_lag_by_merchant.sort_values('time_lag', ascending = False, inplace = True)
time_lag_by_merchant.head(10)
fig = px.bar(time_lag_by_merchant.head(10), x="time_lag", y="merchant",
title='Average time lag by merchant',
labels = {'time_lag': 'Time lag between add-to-cart and purchase (minutes)',
'merchant': 'Merchant (Product category)'})
fig.update_layout(title_x = 0.5)
fig.show()
It's rather difficult to make interprations about this question, because I have many doubts about (1) the data and (2) my calculation method. So I've leave it to the readers to make sense of this.
# Let's examine the number of transactions completed, grouped by time of day
df_purchase_by_hr = df[df['action_type'] == '6'].groupby('hour')['transactionId'].nunique().reset_index()
df_purchase_by_hr.columns = ['hour','number_of_purchases']
df_purchase_by_hr
# Plotting the data on a line graph
fig = px.line(df_purchase_by_hr, x="hour", y="number_of_purchases",
title='Number of purchases completed by time of day')
fig.update_layout(title_x = 0.5)
fig.show()
This graph is interactive! Try moving your mouse to interacte with it.
Quick comment: A very apparent peak-time for orders made is the middle of the day (10 AM in particular). On another note, not a lot of people were online shopping after 9 PM as the number of orders dipped throughout the night. Implications? Maybe set some promotion program at those hours where interest peaked.
In this notebook, I have analyzed sales data from an ecommerce website and obtained some useful insights. Hopefully you have gained something interesting from this too. If you have any comments, please let me know.
end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')