Analyzing Sales Data from Ecommerce Website¶

Author: Giang Son Nguyen.
Date: 26.05.2021

My objective is to perform several data analysis tasks (including one data wrangling task and one case study - more details on the tasks below) to gain key business insights. I'll also be providing some commentaries/ explanations wherever neccessary. The data is a bit limited, but the large volume of interactions still enable some interesting inferences.

Note: The product category names (merchant column) are in Vietnamese, so I will translate them when making interpretations about the data. Really sorry for this inconsistency 😥.

Task 0: Import necessary libraries and load data¶

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from itertools import combinations, permutations
import time

prices_df = pd.read_csv('prices.csv')
sales_df = pd.read_csv('sales.csv')
df = pd.read_csv('case_study.csv',low_memory=False)

start_time= time.process_time()

Task 1: Data Wrangling¶

Questions:¶

Answers:¶

0. Exploratory Data Analysis and Preparation¶

prices_df.head()

sales_df.head()

# Checking the data types of each columns in the 2 dataframes shows that 
# the *updated_at* and *ordered_at* columns are in string format.
print('price_df ' + '*'*20)
print(prices_df.dtypes)
print('sales_df ' + '*'*20)
print(sales_df.dtypes)

price_df ********************
product_id     int64
old_price      int64
new_price      int64
updated_at    object
dtype: object
sales_df ********************
product_id           int64
ordered_at          object
quantity_ordered     int64
dtype: object

# Let's format them to datetime for easier processing. 
prices_df['updated_at'] = pd.to_datetime(prices_df['updated_at'], infer_datetime_format = True)
sales_df['ordered_at'] = pd.to_datetime(sales_df['ordered_at'], infer_datetime_format = True)
# While we're at it, let's change the dtype for *product_id* too (from int to str).
prices_df['product_id'] = prices_df['product_id'].astype('str')
sales_df['product_id'] = sales_df['product_id'].astype('str')

# the results
print(prices_df.dtypes)
print(sales_df.dtypes)

product_id            object
old_price              int64
new_price              int64
updated_at    datetime64[ns]
dtype: object
product_id                  object
ordered_at          datetime64[ns]
quantity_ordered             int64
dtype: object

# I also sort the prices_df by *updated_at* 
# so it's easier to find the right price later on
prices_df_sorted = prices_df.sort_values('updated_at', ascending = False)
prices_df_sorted.head()

1. Finding the newest prices for each product at each timestamp¶

Explainining my logic:
The price for each product at the time of order will be determined in one of two ways:

If there exists an adjustment that ocurred right before the ordered_at time, the price will be the new_price of that adjustment
If there's not an adjustment before, then the price will be the old_price of the adjustments that occurred right after the ordered_at time

So, all I have to do is:

Loop through each row in the sales_df, find the right price for the product at the specific order time using the above logic. The for loop is noot the most efficient method but it gets the work done.
Calculate a new column revenue. Obviously, revenue = price × quantity_ordered. (I prefer my code to be clear so I decided to make this a separate step outside of the for loop)
Sum the revenue column by each product_id and price.

# finding the unit price.
sales_df['price'] = np.zeros(sales_df.shape[0])
for row in range(sales_df.shape[0]): # loop through each row
    # find the product_id and order_at time
    product_id = sales_df.iloc[row]['product_id']
    ordered_at = sales_df.iloc[row]['ordered_at']
    # the conditions for adjustments before 'ordered_at' (logic 1) 
    mask_new = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] < ordered_at)
    # the conditions for adjustments after 'ordered_at' (logic 2) 
    mask_old = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] > ordered_at)
    
    # find the adjustments in the prices_df
    new_price = prices_df_sorted[mask_new]['new_price']
    old_price = prices_df_sorted[mask_old]['old_price']
    
    if len(new_price) > 0 : 
        # if there is adjustment before, than the price will be the new_price of the newest adjustment
        price = new_price.iloc[0]
    else: 
        # if there is no adjustment before, than the price will be the old_price of the oldest adjustment
        price = old_price.iloc[-1]
    sales_df['price'].iloc[row] = price
    
# calculate revenue
sales_df['revenue'] = sales_df['quantity_ordered'] * sales_df['price']

sales_df.head()

# Calculating ther revenue and quantity_order for each product at each price point.
# I also included the sales quantity for reference
sales_df.groupby(['product_id','price']).agg({'revenue':'sum','quantity_ordered':'sum'})

Task 2: Case Study¶

Questions:¶

Answers:¶

0. Data Preparation¶

df.head()

df.shape

(300000, 8)

# I change the dtypes for some columns, similar to the first task
df['interaction_time'] = pd.to_datetime(df['interaction_time'])

df['fullVisitorid'] = df['fullVisitorid'].astype('str')
df['visitId'] = df['visitId'].astype('str')
df['productSKU'] = df['productSKU'].astype('str')
df['action_type'] = df['action_type'].astype('str')
df['transactionId'] = df['transactionId'].astype('str')

a. Remove all rows with null/ other merchant¶

df.dropna(subset = ['merchant'], inplace = True)
df.shape
# the number of rows has decreased

(279621, 8)

b. How many fullVisitorId added products to cart? How many complete purchases? Calculate these number in total by date and hour.¶

# b. How many fullVisitorId added products to cart? 
added_to_cart = df[df['action_type']== '3']['fullVisitorid'].nunique()
print("fullVisitorId added products to cart: ", added_to_cart)

fullVisitorId added products to cart:  149572

# How many complete purchases?
completed_purchase = df[df['action_type']== '6']['fullVisitorid'].nunique()
print("fullVisitorId completed purchases: ", completed_purchase)

fullVisitorId completed purchases:  32185

# Calculate these number in total by date and hour.
df['date'] = df['interaction_time'].dt.date
df['hour'] = df['interaction_time'].dt.hour

df[df['action_type']== '3'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})

df[df['action_type']== '6'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})

c. Calculate cart abandonment rate¶

abandonment_rate = (added_to_cart - completed_purchase)/added_to_cart
print("The average abandonment rate is: ", round(abandonment_rate*100,2), "%")

The average abandonment rate is:  78.48 %

d. Is there any relationship between merchant added to cart and cart abandonment rate?¶

# Create a df of added_to_cart action (action_type = 3), grouped by merchant
df_merchant_added_to_cart = df[df['action_type'] == '3'].groupby('merchant')[['merchant']].count()
df_merchant_added_to_cart.columns = ['added_to_cart']

# Create a df of completed_purchase action (action_type = 6), grouped by merchant
df_merchant_completed_purchase = df[df['action_type'] == '6'].groupby('merchant')[['merchant']].count()
df_merchant_completed_purchase.columns = ['completed_purchase']

# Merge tham together, than calculate the abandonment rate for each type of merchant
df_merchant = pd.merge(df_merchant_added_to_cart,df_merchant_completed_purchase, on = 'merchant').reset_index()
df_merchant['abandonment_rate'] = (df_merchant['added_to_cart'] - df_merchant['completed_purchase'])/df_merchant['added_to_cart']

# Sorting merchant by abandonment rate
df_merchant = df_merchant[df_merchant['abandonment_rate'] >0].sort_values('abandonment_rate', ascending = False)
df_merchant.head(10)

# Setting the style for sns plot. I personally prefer big plots so figsize will be (16,10)
sns.set(style='whitegrid',
       palette="deep", 
       font_scale=1.1, 
       rc={"figure.figsize": [16, 10]}
      )

# Plot the merchants whose abandonment rates are above average
ax = sns.barplot(data = df_merchant[df_merchant['abandonment_rate'] > abandonment_rate], x= 'abandonment_rate', y = 'merchant')
ax.set_title('Merchants with abandonment rate above average')
ax.set_xlabel("Abandonment rate")
ax.set_ylabel("Merchant (Product categories)")
plt.show()

Quick comment: The merchants with the highest abandonment rates are product categories that are typically high-priced (at least in Vietnam):

Laptop & PC
Smartphones - Tablets
Motorbikes-Car-Bikes
Household Applicances
TV - Audio

e. What product/merchant is frequently bought together?¶

# In this question, we only care about transactions where 
# the purchases have been completed and at least 2 types of products are involved
# which means, action_type = 6 the transactionId must appear in at least 2 different rows

# Let's create a dataframe consisting solely of the transactions that fit our description
df_duplicate = df[(df['transactionId'].duplicated(keep=False)) & (df['action_type']== '6') & (df['merchant'] != 'Promotion')][['transactionId','merchant']]
df_duplicate.drop_duplicates(inplace = True)
df_duplicate.head()

# Define a function that find the pairs of product
# You can fine more explaination here: https://youtu.be/iDV0Zh6HDGc
def find_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values,2)), columns = ['item_1','item_2'])
    return pairs

# Create a dataframe consisting of all the possible combo 
df_combo = df_duplicate.groupby('transactionId')['merchant'].apply(find_pairs).reset_index(drop = True)

# Calculate the frequency of them being bought together
df_combo = df_combo.groupby(['item_1','item_2']).size().reset_index()
df_combo.columns = ['item_1','item_2','frequency']
df_combo.sort_values('frequency',ascending = False,inplace = True)

# Create a new column for the combo name
df_combo['combo']= df_combo['item_1'] + ' & ' + df_combo['item_2'] 
df_combo.head()

# You'll notice that each category is shown twice (once as item_1 & item_2 and once as item_2 & item_1)
# so we only need to take the even rows of that dataframe
df_combo = df_combo[['combo','frequency']][::2]
df_combo.head()

# Plotting the pairs and their frequency of being bought together
ax = sns.barplot(data = df_combo.head(10), y = 'combo', x = 'frequency')
ax.set_title("Top 10 product pairs most frequently sold together")
ax.set_ylabel("Pair of products")
ax.set_xlabel("Number of times bought together")
plt.show()

Quick comment: The most frequently brough together pair is: Vietnamese books & office supplies (pens, paper, etc...)

Surprisingly, Vietnamese books appear in 7 of the top 10 pairs. However, I wouldn't read too much into this because it's very likely that customers were buying books on this site and just happened to add something else in the same order (trust me, I've been there too).

f.Calculate the average of time-lag from first add-to-cart product until its purchase time?¶

In case user have several transactionIds, consider only product is added in the time interval between two consecutive transactions

df_purchases = df[df['action_type'] == '6'][['fullVisitorid','source','interaction_time','productSKU','merchant']]
df_purchases.head()

df_to_cart = df[df['action_type'] == '3'].groupby(['fullVisitorid','productSKU'])['interaction_time'].min().reset_index()
df_to_cart.head()

df_timelag = pd.merge(df_purchases,df_to_cart, how = 'left', on = ['fullVisitorid','productSKU'])
df_timelag.rename(columns = {'interaction_time_x': 'completed_purchase_time', 'interaction_time_y': 'added_to_cart_time'}, inplace = True)

# Calculate the time lag between the first added to cart product and the time of completing purchase
df_timelag['time_lag'] = df_timelag['completed_purchase_time'] - df_timelag['added_to_cart_time']

# Some orders don't have an add to cart time (?), so I assume the purchase is made immediately after the product is added to cart
# which means the time_lag for those transactions is 0
df_timelag['time_lag'].fillna(pd.Timedelta(seconds=0), inplace = True)
# convert the time_lag to minutes
df_timelag['time_lag'] = df_timelag['time_lag'].dt.total_seconds()/60
df_timelag.head()

print("Average time lag between first add-to-cart product and purchase time is: ", round(df_timelag['time_lag'].mean(),2), ' minutes')

Average time lag between first add-to-cart product and purchase time is:  86.09  minutes

# find the time lag for each merchant
time_lag_by_merchant = df_timelag.groupby('merchant')['time_lag'].mean().reset_index()
time_lag_by_merchant.sort_values('time_lag', ascending = False, inplace = True)
time_lag_by_merchant.head(10)

fig = px.bar(time_lag_by_merchant.head(10), x="time_lag", y="merchant", 
             title='Average time lag by merchant',
             labels = {'time_lag': 'Time lag between add-to-cart and purchase (minutes)',
                      'merchant': 'Merchant (Product category)'}) 
fig.update_layout(title_x = 0.5)
fig.show()

It's rather difficult to make interprations about this question, because I have many doubts about (1) the data and (2) my calculation method. So I've leave it to the readers to make sense of this.

g. Is there any other insight that you think is interesting?¶

# Let's examine the number of transactions completed, grouped by time of day
df_purchase_by_hr = df[df['action_type'] == '6'].groupby('hour')['transactionId'].nunique().reset_index()
df_purchase_by_hr.columns = ['hour','number_of_purchases']

df_purchase_by_hr

# Plotting the data on a line graph
fig = px.line(df_purchase_by_hr, x="hour", y="number_of_purchases", 
             title='Number of purchases completed by time of day')
fig.update_layout(title_x = 0.5)
fig.show()

This graph is interactive! Try moving your mouse to interacte with it.

Quick comment: A very apparent peak-time for orders made is the middle of the day (10 AM in particular). On another note, not a lot of people were online shopping after 9 PM as the number of orders dipped throughout the night. Implications? Maybe set some promotion program at those hours where interest peaked.

Conclusion¶

In this notebook, I have analyzed sales data from an ecommerce website and obtained some useful insights. Hopefully you have gained something interesting from this too. If you have any comments, please let me know.

Thank you for reading.¶

end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')

This notebook was run in:  17.47  seconds

	product_id	old_price	new_price	updated_at
0	64	270000	239000	9/10/18 16:37
1	3954203	60000	64000	9/11/18 11:54
2	3954203	60500	57500	9/17/18 22:59
3	3954203	64000	60500	9/15/18 3:49
4	3954203	68800	60000	9/10/18 16:32

	product_id	ordered_at	quantity_ordered
0	3998909	9/18/18 17:51	1
1	3998909	9/18/18 12:52	1
2	3998909	9/18/18 11:33	1
3	3998909	9/18/18 18:47	1
4	3998909	9/18/18 17:36	1

	product_id	old_price	new_price	updated_at
2	3954203	60500	57500	2018-09-17 22:59:00
8	4085861	53500	52000	2018-09-17 22:59:00
10	4085861	58000	53500	2018-09-17 03:35:00
5	3998909	15500	16500	2018-09-16 05:09:00
12	4085861	62500	58000	2018-09-15 03:51:00

	product_id	ordered_at	quantity_ordered	price	revenue
0	3998909	2018-09-18 17:51:00	1	16500.0	16500.0
1	3998909	2018-09-18 12:52:00	1	16500.0	16500.0
2	3998909	2018-09-18 11:33:00	1	16500.0	16500.0
3	3998909	2018-09-18 18:47:00	1	16500.0	16500.0
4	3998909	2018-09-18 17:36:00	1	16500.0	16500.0

		revenue	quantity_ordered
product_id	price
3954203	57500.0	57500.0	1
	60000.0	180000.0	3
	64000.0	640000.0	10
3998909	15500.0	15500.0	1
	16500.0	231000.0	14
	17000.0	34000.0	2
4085861	52000.0	1040000.0	20
	53500.0	2140000.0	40
	58000.0	2204000.0	38
	60000.0	180000.0	3
	62500.0	1812500.0	29
	67000.0	871000.0	13
64	239000.0	956000.0	4

Analyzing Sales Data from Ecommerce Website¶

Table of Contents

Overview¶

Task 0: Import necessary libraries and load data¶

Task 1: Data Wrangling¶

Questions:¶

Answers:¶

0. Exploratory Data Analysis and Preparation¶

1. Finding the newest prices for each product at each timestamp¶

Task 2: Case Study¶

Questions:¶

Answers:¶

0. Data Preparation¶

a. Remove all rows with null/ other merchant¶

b. How many fullVisitorId added products to cart? How many complete purchases? Calculate these number in total by date and hour.¶

c. Calculate cart abandonment rate¶

d. Is there any relationship between merchant added to cart and cart abandonment rate?¶

e. What product/merchant is frequently bought together?¶

f.Calculate the average of time-lag from first add-to-cart product until its purchase time?¶

g. Is there any other insight that you think is interesting?¶

Conclusion¶

Thank you for reading.¶

	fullVisitorid	visitId	source	interaction_time	productSKU	action_type	transactionId	merchant
0	10796433617570240226	1537231206	FB	2018-09-18 08:05:10	276921	3	NaN	Sách Tiếng Việt
1	3420023669829788288	1537339917	FB	2018-09-19 13:53:09	352156	3	NaN	Sách Tiếng Việt
2	260086599438307401	1536730102	FB	2018-09-12 12:49:53	392955	3	NaN	Làm đẹp - Sức khỏe
3	3156776568682643722	1536885003	FB	2018-09-14 07:30:17	393079	3	NaN	Làm đẹp - Sức khỏe
4	610658073725040936	1536882984	FB	2018-09-14 06:58:34	393079	3	NaN	Làm đẹp - Sức khỏe

		fullVisitorid
date	hour
2018-09-10	0	549
	1	228
	2	116
	3	73
	4	81
...	...	...
2018-09-19	19	1009
	20	1274
	21	1346
	22	1224
	23	790

	merchant	added_to_cart	completed_purchase	abandonment_rate
3	Laptop & PC	997	56	0.943831
21	Điện thoại - Máy tính bảng	7577	567	0.925168
19	Xe máy, ô tô, xe đạp	2979	229	0.923129
20	Điện lạnh	1302	123	0.905530
16	Tivi - Audio	1837	192	0.895482
15	Thời trang	15097	1580	0.895343
14	Thể Thao & Dã Ngoại	5385	629	0.883194
0	Bách Hóa Online	14306	2098	0.853348
5	Máy ảnh & Phụ kiện	1451	218	0.849759
4	Làm đẹp - Sức khỏe	21737	3497	0.839122

	transactionId	merchant
120	754835353	Sách Tiếng Việt
123	754835353	Làm đẹp - Sức khỏe
182	246993893	Xe máy, ô tô, xe đạp
396	538738164	Sách Tiếng Việt
398	200325610	Sách Tiếng Việt

	item_1	item_2	frequency	combo
135	Sách Tiếng Việt	Văn phòng phẩm	185	Sách Tiếng Việt & Văn phòng phẩm
211	Văn phòng phẩm	Sách Tiếng Việt	185	Văn phòng phẩm & Sách Tiếng Việt
54	Làm đẹp - Sức khỏe	Sách Tiếng Việt	129	Làm đẹp - Sức khỏe & Sách Tiếng Việt
124	Sách Tiếng Việt	Làm đẹp - Sức khỏe	129	Sách Tiếng Việt & Làm đẹp - Sức khỏe
76	Mẹ và Bé	Sách Tiếng Việt	115	Mẹ và Bé & Sách Tiếng Việt

	fullVisitorid	source	interaction_time	productSKU	merchant
30	1917931473058109470	FB	2018-09-13 21:41:06	888949	Làm đẹp - Sức khỏe
38	806808667484303094	FB	2018-09-13 17:10:54	1320851	Làm đẹp - Sức khỏe
41	6875896961917464162	FB	2018-09-14 12:05:20	1320851	Làm đẹp - Sức khỏe
120	3278199159587501495	FB	2018-09-14 17:40:11	1694077	Sách Tiếng Việt
123	3278199159587501495	FB	2018-09-14 17:40:11	2738499	Làm đẹp - Sức khỏe

	fullVisitorid	productSKU	interaction_time
0	10000047349280622264	3630829	2018-09-16 15:25:17
1	10000047349280622264	4310671	2018-09-16 15:24:35
2	10000428164449421463	1745163	2018-09-19 08:57:01
3	10000777959432528570	3953405	2018-09-15 02:33:26
4	10000784981573917907	1408705	2018-09-11 19:42:47

	merchant	time_lag
8	Mẹ và Bé	270.320141
23	Điện lạnh	201.059621
20	Voucher - Dịch vụ	180.520200
24	Điện thoại - Máy tính bảng	127.247443
22	Xe máy, ô tô, xe đạp	120.698836
13	Sách Tiếng Việt	93.567515
4	Gia dụng	92.332197
21	Văn phòng phẩm	91.076305
16	Thể Thao & Dã Ngoại	85.154849
11	Quà Lưu Niệm	76.715783

	hour	number_of_purchases
0	0	611
1	1	303
2	2	137
3	3	78
4	4	99
5	5	139
6	6	286
7	7	663
8	8	1714
9	9	2617
10	10	3295
11	11	2996
12	12	2231
13	13	2644
14	14	2800
15	15	2619
16	16	2394
17	17	1791
18	18	1340
19	19	1308
20	20	1676
21	21	1885
22	22	1725
23	23	1222