Analyzing Sales Data from Ecommerce Website

Author: Giang Son Nguyen.
Date: 26.05.2021

Overview

In this project, I've been given 10-day sales data (from September 2018) of an undisclosed ecommerce website in Vietnam.

My objective is to perform several data analysis tasks (including one data wrangling task and one case study - more details on the tasks below) to gain key business insights. I'll also be providing some commentaries/ explanations wherever neccessary. The data is a bit limited, but the large volume of interactions still enable some interesting inferences.

Note: The product category names (merchant column) are in Vietnamese, so I will translate them when making interpretations about the data. Really sorry for this inconsistency 😥.

Task 0: Import necessary libraries and load data

In [1]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from itertools import combinations, permutations
import time
In [2]:
prices_df = pd.read_csv('prices.csv')
sales_df = pd.read_csv('sales.csv')
df = pd.read_csv('case_study.csv',low_memory=False)

start_time= time.process_time()

Task 1: Data Wrangling

Questions:

image.png

Answers:

0. Exploratory Data Analysis and Preparation

In [3]:
prices_df.head()
Out[3]:
product_id old_price new_price updated_at
0 64 270000 239000 9/10/18 16:37
1 3954203 60000 64000 9/11/18 11:54
2 3954203 60500 57500 9/17/18 22:59
3 3954203 64000 60500 9/15/18 3:49
4 3954203 68800 60000 9/10/18 16:32
In [4]:
sales_df.head()
Out[4]:
product_id ordered_at quantity_ordered
0 3998909 9/18/18 17:51 1
1 3998909 9/18/18 12:52 1
2 3998909 9/18/18 11:33 1
3 3998909 9/18/18 18:47 1
4 3998909 9/18/18 17:36 1
In [5]:
# Checking the data types of each columns in the 2 dataframes shows that 
# the *updated_at* and *ordered_at* columns are in string format.
print('price_df ' + '*'*20)
print(prices_df.dtypes)
print('sales_df ' + '*'*20)
print(sales_df.dtypes)
price_df ********************
product_id     int64
old_price      int64
new_price      int64
updated_at    object
dtype: object
sales_df ********************
product_id           int64
ordered_at          object
quantity_ordered     int64
dtype: object
In [6]:
# Let's format them to datetime for easier processing. 
prices_df['updated_at'] = pd.to_datetime(prices_df['updated_at'], infer_datetime_format = True)
sales_df['ordered_at'] = pd.to_datetime(sales_df['ordered_at'], infer_datetime_format = True)
# While we're at it, let's change the dtype for *product_id* too (from int to str).
prices_df['product_id'] = prices_df['product_id'].astype('str')
sales_df['product_id'] = sales_df['product_id'].astype('str')
In [7]:
# the results
print(prices_df.dtypes)
print(sales_df.dtypes)
product_id            object
old_price              int64
new_price              int64
updated_at    datetime64[ns]
dtype: object
product_id                  object
ordered_at          datetime64[ns]
quantity_ordered             int64
dtype: object
In [8]:
# I also sort the prices_df by *updated_at* 
# so it's easier to find the right price later on
prices_df_sorted = prices_df.sort_values('updated_at', ascending = False)
prices_df_sorted.head()
Out[8]:
product_id old_price new_price updated_at
2 3954203 60500 57500 2018-09-17 22:59:00
8 4085861 53500 52000 2018-09-17 22:59:00
10 4085861 58000 53500 2018-09-17 03:35:00
5 3998909 15500 16500 2018-09-16 05:09:00
12 4085861 62500 58000 2018-09-15 03:51:00

1. Finding the newest prices for each product at each timestamp

Explainining my logic:
The price for each product at the time of order will be determined in one of two ways:

  1. If there exists an adjustment that ocurred right before the ordered_at time, the price will be the new_price of that adjustment
  2. If there's not an adjustment before, then the price will be the old_price of the adjustments that occurred right after the ordered_at time

So, all I have to do is:

  1. Loop through each row in the sales_df, find the right price for the product at the specific order time using the above logic. The for loop is noot the most efficient method but it gets the work done.
  2. Calculate a new column revenue. Obviously, revenue = price × quantity_ordered. (I prefer my code to be clear so I decided to make this a separate step outside of the for loop)
  3. Sum the revenue column by each product_id and price.
In [9]:
# finding the unit price.
sales_df['price'] = np.zeros(sales_df.shape[0])
for row in range(sales_df.shape[0]): # loop through each row
    # find the product_id and order_at time
    product_id = sales_df.iloc[row]['product_id']
    ordered_at = sales_df.iloc[row]['ordered_at']
    # the conditions for adjustments before 'ordered_at' (logic 1) 
    mask_new = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] < ordered_at)
    # the conditions for adjustments after 'ordered_at' (logic 2) 
    mask_old = (prices_df_sorted['product_id'] == product_id) & (prices_df_sorted['updated_at'] > ordered_at)
    
    # find the adjustments in the prices_df
    new_price = prices_df_sorted[mask_new]['new_price']
    old_price = prices_df_sorted[mask_old]['old_price']
    
    if len(new_price) > 0 : 
        # if there is adjustment before, than the price will be the new_price of the newest adjustment
        price = new_price.iloc[0]
    else: 
        # if there is no adjustment before, than the price will be the old_price of the oldest adjustment
        price = old_price.iloc[-1]
    sales_df['price'].iloc[row] = price
    
# calculate revenue
sales_df['revenue'] = sales_df['quantity_ordered'] * sales_df['price']
In [10]:
sales_df.head()
Out[10]:
product_id ordered_at quantity_ordered price revenue
0 3998909 2018-09-18 17:51:00 1 16500.0 16500.0
1 3998909 2018-09-18 12:52:00 1 16500.0 16500.0
2 3998909 2018-09-18 11:33:00 1 16500.0 16500.0
3 3998909 2018-09-18 18:47:00 1 16500.0 16500.0
4 3998909 2018-09-18 17:36:00 1 16500.0 16500.0
In [11]:
# Calculating ther revenue and quantity_order for each product at each price point.
# I also included the sales quantity for reference
sales_df.groupby(['product_id','price']).agg({'revenue':'sum','quantity_ordered':'sum'})
Out[11]:
revenue quantity_ordered
product_id price
3954203 57500.0 57500.0 1
60000.0 180000.0 3
64000.0 640000.0 10
3998909 15500.0 15500.0 1
16500.0 231000.0 14
17000.0 34000.0 2
4085861 52000.0 1040000.0 20
53500.0 2140000.0 40
58000.0 2204000.0 38
60000.0 180000.0 3
62500.0 1812500.0 29
67000.0 871000.0 13
64 239000.0 956000.0 4

Task 2: Case Study

Questions:

image-2.png

Answers:

0. Data Preparation

In [12]:
df.head()
Out[12]:
fullVisitorid visitId source interaction_time productSKU action_type transactionId merchant
0 10796433617570240226 1537231206 FB 2018-09-18 08:05:10 276921 3 NaN Sách Tiếng Việt
1 3420023669829788288 1537339917 FB 2018-09-19 13:53:09 352156 3 NaN Sách Tiếng Việt
2 260086599438307401 1536730102 FB 2018-09-12 12:49:53 392955 3 NaN Làm đẹp - Sức khỏe
3 3156776568682643722 1536885003 FB 2018-09-14 07:30:17 393079 3 NaN Làm đẹp - Sức khỏe
4 610658073725040936 1536882984 FB 2018-09-14 06:58:34 393079 3 NaN Làm đẹp - Sức khỏe
In [13]:
df.shape
Out[13]:
(300000, 8)
In [14]:
# I change the dtypes for some columns, similar to the first task
df['interaction_time'] = pd.to_datetime(df['interaction_time'])

df['fullVisitorid'] = df['fullVisitorid'].astype('str')
df['visitId'] = df['visitId'].astype('str')
df['productSKU'] = df['productSKU'].astype('str')
df['action_type'] = df['action_type'].astype('str')
df['transactionId'] = df['transactionId'].astype('str')

a. Remove all rows with null/ other merchant

In [15]:
df.dropna(subset = ['merchant'], inplace = True)
df.shape
# the number of rows has decreased
Out[15]:
(279621, 8)

b. How many fullVisitorId added products to cart? How many complete purchases? Calculate these number in total by date and hour.

In [16]:
# b. How many fullVisitorId added products to cart? 
added_to_cart = df[df['action_type']== '3']['fullVisitorid'].nunique()
print("fullVisitorId added products to cart: ", added_to_cart)
fullVisitorId added products to cart:  149572
In [17]:
# How many complete purchases?
completed_purchase = df[df['action_type']== '6']['fullVisitorid'].nunique()
print("fullVisitorId completed purchases: ", completed_purchase)
fullVisitorId completed purchases:  32185
In [18]:
# Calculate these number in total by date and hour.
df['date'] = df['interaction_time'].dt.date
df['hour'] = df['interaction_time'].dt.hour
In [19]:
df[df['action_type']== '3'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})
Out[19]:
fullVisitorid
date hour
2018-09-10 0 549
1 228
2 116
3 73
4 81
... ... ...
2018-09-19 19 1009
20 1274
21 1346
22 1224
23 790

240 rows × 1 columns

In [20]:
df[df['action_type']== '6'].groupby(['date','hour']).agg({'fullVisitorid': pd.Series.nunique})
Out[20]:
fullVisitorid
date hour
2018-09-10 0 79
1 40
2 17
3 10
4 10
... ... ...
2018-09-19 19 157
20 174
21 212
22 191
23 114

239 rows × 1 columns

c. Calculate cart abandonment rate

In [21]:
abandonment_rate = (added_to_cart - completed_purchase)/added_to_cart
print("The average abandonment rate is: ", round(abandonment_rate*100,2), "%")
The average abandonment rate is:  78.48 %

d. Is there any relationship between merchant added to cart and cart abandonment rate?

In [22]:
# Create a df of added_to_cart action (action_type = 3), grouped by merchant
df_merchant_added_to_cart = df[df['action_type'] == '3'].groupby('merchant')[['merchant']].count()
df_merchant_added_to_cart.columns = ['added_to_cart']

# Create a df of completed_purchase action (action_type = 6), grouped by merchant
df_merchant_completed_purchase = df[df['action_type'] == '6'].groupby('merchant')[['merchant']].count()
df_merchant_completed_purchase.columns = ['completed_purchase']
In [23]:
# Merge tham together, than calculate the abandonment rate for each type of merchant
df_merchant = pd.merge(df_merchant_added_to_cart,df_merchant_completed_purchase, on = 'merchant').reset_index()
df_merchant['abandonment_rate'] = (df_merchant['added_to_cart'] - df_merchant['completed_purchase'])/df_merchant['added_to_cart']
In [24]:
# Sorting merchant by abandonment rate
df_merchant = df_merchant[df_merchant['abandonment_rate'] >0].sort_values('abandonment_rate', ascending = False)
df_merchant.head(10)
Out[24]:
merchant added_to_cart completed_purchase abandonment_rate
3 Laptop & PC 997 56 0.943831
21 Điện thoại - Máy tính bảng 7577 567 0.925168
19 Xe máy, ô tô, xe đạp 2979 229 0.923129
20 Điện lạnh 1302 123 0.905530
16 Tivi - Audio 1837 192 0.895482
15 Thời trang 15097 1580 0.895343
14 Thể Thao & Dã Ngoại 5385 629 0.883194
0 Bách Hóa Online 14306 2098 0.853348
5 Máy ảnh & Phụ kiện 1451 218 0.849759
4 Làm đẹp - Sức khỏe 21737 3497 0.839122
In [25]:
# Setting the style for sns plot. I personally prefer big plots so figsize will be (16,10)
sns.set(style='whitegrid',
       palette="deep", 
       font_scale=1.1, 
       rc={"figure.figsize": [16, 10]}
      )
In [26]:
# Plot the merchants whose abandonment rates are above average
ax = sns.barplot(data = df_merchant[df_merchant['abandonment_rate'] > abandonment_rate], x= 'abandonment_rate', y = 'merchant')
ax.set_title('Merchants with abandonment rate above average')
ax.set_xlabel("Abandonment rate")
ax.set_ylabel("Merchant (Product categories)")
plt.show()

Quick comment: The merchants with the highest abandonment rates are product categories that are typically high-priced (at least in Vietnam):

  1. Laptop & PC
  2. Smartphones - Tablets
  3. Motorbikes-Car-Bikes
  4. Household Applicances
  5. TV - Audio

e. What product/merchant is frequently bought together?

In [27]:
# In this question, we only care about transactions where 
# the purchases have been completed and at least 2 types of products are involved
# which means, action_type = 6 the transactionId must appear in at least 2 different rows

# Let's create a dataframe consisting solely of the transactions that fit our description
df_duplicate = df[(df['transactionId'].duplicated(keep=False)) & (df['action_type']== '6') & (df['merchant'] != 'Promotion')][['transactionId','merchant']]
df_duplicate.drop_duplicates(inplace = True)
df_duplicate.head()
Out[27]:
transactionId merchant
120 754835353 Sách Tiếng Việt
123 754835353 Làm đẹp - Sức khỏe
182 246993893 Xe máy, ô tô, xe đạp
396 538738164 Sách Tiếng Việt
398 200325610 Sách Tiếng Việt
In [28]:
# Define a function that find the pairs of product
# You can fine more explaination here: https://youtu.be/iDV0Zh6HDGc
def find_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values,2)), columns = ['item_1','item_2'])
    return pairs
In [29]:
# Create a dataframe consisting of all the possible combo 
df_combo = df_duplicate.groupby('transactionId')['merchant'].apply(find_pairs).reset_index(drop = True)

# Calculate the frequency of them being bought together
df_combo = df_combo.groupby(['item_1','item_2']).size().reset_index()
df_combo.columns = ['item_1','item_2','frequency']
df_combo.sort_values('frequency',ascending = False,inplace = True)

# Create a new column for the combo name
df_combo['combo']= df_combo['item_1'] + ' & ' + df_combo['item_2'] 
df_combo.head()
Out[29]:
item_1 item_2 frequency combo
135 Sách Tiếng Việt Văn phòng phẩm 185 Sách Tiếng Việt & Văn phòng phẩm
211 Văn phòng phẩm Sách Tiếng Việt 185 Văn phòng phẩm & Sách Tiếng Việt
54 Làm đẹp - Sức khỏe Sách Tiếng Việt 129 Làm đẹp - Sức khỏe & Sách Tiếng Việt
124 Sách Tiếng Việt Làm đẹp - Sức khỏe 129 Sách Tiếng Việt & Làm đẹp - Sức khỏe
76 Mẹ và Bé Sách Tiếng Việt 115 Mẹ và Bé & Sách Tiếng Việt
In [30]:
# You'll notice that each category is shown twice (once as item_1 & item_2 and once as item_2 & item_1)
# so we only need to take the even rows of that dataframe
df_combo = df_combo[['combo','frequency']][::2]
df_combo.head()
Out[30]:
combo frequency
135 Sách Tiếng Việt & Văn phòng phẩm 185
54 Làm đẹp - Sức khỏe & Sách Tiếng Việt 129
76 Mẹ và Bé & Sách Tiếng Việt 115
129 Sách Tiếng Việt & Sách Tiếng Anh 88
131 Sách Tiếng Việt & Thiết bị số - Phụ kiện số 74
In [31]:
# Plotting the pairs and their frequency of being bought together
ax = sns.barplot(data = df_combo.head(10), y = 'combo', x = 'frequency')
ax.set_title("Top 10 product pairs most frequently sold together")
ax.set_ylabel("Pair of products")
ax.set_xlabel("Number of times bought together")
plt.show()

Quick comment: The most frequently brough together pair is: Vietnamese books & office supplies (pens, paper, etc...)

Surprisingly, Vietnamese books appear in 7 of the top 10 pairs. However, I wouldn't read too much into this because it's very likely that customers were buying books on this site and just happened to add something else in the same order (trust me, I've been there too).

f.Calculate the average of time-lag from first add-to-cart product until its purchase time?

In case user have several transactionIds, consider only product is added in the time interval between two consecutive transactions

In [32]:
df_purchases = df[df['action_type'] == '6'][['fullVisitorid','source','interaction_time','productSKU','merchant']]
df_purchases.head()
Out[32]:
fullVisitorid source interaction_time productSKU merchant
30 1917931473058109470 FB 2018-09-13 21:41:06 888949 Làm đẹp - Sức khỏe
38 806808667484303094 FB 2018-09-13 17:10:54 1320851 Làm đẹp - Sức khỏe
41 6875896961917464162 FB 2018-09-14 12:05:20 1320851 Làm đẹp - Sức khỏe
120 3278199159587501495 FB 2018-09-14 17:40:11 1694077 Sách Tiếng Việt
123 3278199159587501495 FB 2018-09-14 17:40:11 2738499 Làm đẹp - Sức khỏe
In [33]:
df_to_cart = df[df['action_type'] == '3'].groupby(['fullVisitorid','productSKU'])['interaction_time'].min().reset_index()
df_to_cart.head()
Out[33]:
fullVisitorid productSKU interaction_time
0 10000047349280622264 3630829 2018-09-16 15:25:17
1 10000047349280622264 4310671 2018-09-16 15:24:35
2 10000428164449421463 1745163 2018-09-19 08:57:01
3 10000777959432528570 3953405 2018-09-15 02:33:26
4 10000784981573917907 1408705 2018-09-11 19:42:47
In [34]:
df_timelag = pd.merge(df_purchases,df_to_cart, how = 'left', on = ['fullVisitorid','productSKU'])
df_timelag.rename(columns = {'interaction_time_x': 'completed_purchase_time', 'interaction_time_y': 'added_to_cart_time'}, inplace = True)

# Calculate the time lag between the first added to cart product and the time of completing purchase
df_timelag['time_lag'] = df_timelag['completed_purchase_time'] - df_timelag['added_to_cart_time']

# Some orders don't have an add to cart time (?), so I assume the purchase is made immediately after the product is added to cart
# which means the time_lag for those transactions is 0
df_timelag['time_lag'].fillna(pd.Timedelta(seconds=0), inplace = True)
# convert the time_lag to minutes
df_timelag['time_lag'] = df_timelag['time_lag'].dt.total_seconds()/60
df_timelag.head()
Out[34]:
fullVisitorid source completed_purchase_time productSKU merchant added_to_cart_time time_lag
0 1917931473058109470 FB 2018-09-13 21:41:06 888949 Làm đẹp - Sức khỏe NaT 0.000000
1 806808667484303094 FB 2018-09-13 17:10:54 1320851 Làm đẹp - Sức khỏe 2018-09-13 17:06:12 4.700000
2 6875896961917464162 FB 2018-09-14 12:05:20 1320851 Làm đẹp - Sức khỏe 2018-09-14 12:04:51 0.483333
3 3278199159587501495 FB 2018-09-14 17:40:11 1694077 Sách Tiếng Việt 2018-09-14 17:38:25 1.766667
4 3278199159587501495 FB 2018-09-14 17:40:11 2738499 Làm đẹp - Sức khỏe 2018-09-14 17:39:48 0.383333
In [35]:
print("Average time lag between first add-to-cart product and purchase time is: ", round(df_timelag['time_lag'].mean(),2), ' minutes')
Average time lag between first add-to-cart product and purchase time is:  86.09  minutes
In [36]:
# find the time lag for each merchant
time_lag_by_merchant = df_timelag.groupby('merchant')['time_lag'].mean().reset_index()
time_lag_by_merchant.sort_values('time_lag', ascending = False, inplace = True)
time_lag_by_merchant.head(10)
Out[36]:
merchant time_lag
8 Mẹ và Bé 270.320141
23 Điện lạnh 201.059621
20 Voucher - Dịch vụ 180.520200
24 Điện thoại - Máy tính bảng 127.247443
22 Xe máy, ô tô, xe đạp 120.698836
13 Sách Tiếng Việt 93.567515
4 Gia dụng 92.332197
21 Văn phòng phẩm 91.076305
16 Thể Thao & Dã Ngoại 85.154849
11 Quà Lưu Niệm 76.715783
In [37]:
fig = px.bar(time_lag_by_merchant.head(10), x="time_lag", y="merchant", 
             title='Average time lag by merchant',
             labels = {'time_lag': 'Time lag between add-to-cart and purchase (minutes)',
                      'merchant': 'Merchant (Product category)'}) 
fig.update_layout(title_x = 0.5)
fig.show()

It's rather difficult to make interprations about this question, because I have many doubts about (1) the data and (2) my calculation method. So I've leave it to the readers to make sense of this.

g. Is there any other insight that you think is interesting?

In [38]:
# Let's examine the number of transactions completed, grouped by time of day
df_purchase_by_hr = df[df['action_type'] == '6'].groupby('hour')['transactionId'].nunique().reset_index()
df_purchase_by_hr.columns = ['hour','number_of_purchases']
In [39]:
df_purchase_by_hr
Out[39]:
hour number_of_purchases
0 0 611
1 1 303
2 2 137
3 3 78
4 4 99
5 5 139
6 6 286
7 7 663
8 8 1714
9 9 2617
10 10 3295
11 11 2996
12 12 2231
13 13 2644
14 14 2800
15 15 2619
16 16 2394
17 17 1791
18 18 1340
19 19 1308
20 20 1676
21 21 1885
22 22 1725
23 23 1222
In [40]:
# Plotting the data on a line graph
fig = px.line(df_purchase_by_hr, x="hour", y="number_of_purchases", 
             title='Number of purchases completed by time of day')
fig.update_layout(title_x = 0.5)
fig.show()

This graph is interactive! Try moving your mouse to interacte with it.

Quick comment: A very apparent peak-time for orders made is the middle of the day (10 AM in particular). On another note, not a lot of people were online shopping after 9 PM as the number of orders dipped throughout the night. Implications? Maybe set some promotion program at those hours where interest peaked.

Conclusion

In this notebook, I have analyzed sales data from an ecommerce website and obtained some useful insights. Hopefully you have gained something interesting from this too. If you have any comments, please let me know.

Thank you for reading.


In [41]:
end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')
This notebook was run in:  17.47  seconds