Author: Giang Son Nguyen.
Date: 30.05.2021
In this project, I'm using the Online Retail dataset from here. This dataset contains the transactions in the period between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. It's also a classic dataset for this this type of task.
The descriptions for each column is as below:
InvoiceNo
: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.StockCode
: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.Description
: Product (item) name. Nominal.Quantity
: The quantities of each product (item) per transaction. Numeric.InvoiceDate
: Invice date and time. Numeric. The day and time when a transaction was generated.UnitPrice
: Unit price. Numeric. Product price per unit in sterling (รยฃ).CustomerID
: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.Country
: Country name. Nominal. The name of the country where a customer resides.My tasks is to perform segmentation for the customers of this retail store based on RFM scores (Recency
, Frequency
, Monetary
) - a very popular framework for customer segmentation. For better understanding of the RFM method, I suggest you read further here or here.
To classify the R, F, M scores of each customers, I'll use k-means clustering algorithm. As the name suggests, this algorithm will divide the data points into different clusters and it does so in a way that ensures the data points in the same cluster are as similar to each other as possible. Although this method has several limitations, it's simple and powerful enough to achieve our purpose in an efficient way.
For a visual representations of how this algorithm works, you can check out this video. For a deeper dive into k-means and other clustering methods, I recommend this article.
At the end, our customers will be justly segmented based on their purchase behaviors. As a business, we can use this information to devise appropriate actions, eg: creating reward systems for high-paying customers, or offer discounts for customers who haven't purchased in a long time, etc...
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import time
start_time= time.process_time()
df_all = pd.read_csv('OnlineRetail.csv',encoding= 'unicode_escape')
print('Data has been loaded!')
print("This dataset has", df_all.shape[0], "rows and", df_all.shape[1], 'columns.')
df_all.head()
Selecting UK Customers:
As we know, this data set came from a retailer in the UK, and the majority (>90%) of customers are from the UK as well. For interpretation's sake, we'll only analyze the data for UK customers.
print("UK customers:", df_all[df_all['Country'] == 'United Kingdom']['InvoiceNo'].count())
print("Non-UK customers:", df_all[df_all['Country'] != 'United Kingdom']['InvoiceNo'].count())
print('*'*50)
df = df_all[df_all['Country'] == 'United Kingdom']
print("UK Customers selected.")
print("This dataframe now has", df.shape[0], "rows.")
df.head()
Check for missing values:
We see that there's 1454 null entries in the Description
column and 133600 missing values in the CustomerID
column.
CustomerID
will sadly be useless and shall be dropped.Description
because it is of little significance.# Before
print('Before:')
print(np.sum(df.isna()))
print('*'*50)
# Remove rows with nan
df.dropna(subset = ['CustomerID'], inplace = True)
# After
print('After:')
print(np.sum(df.isna()))
Check for negative values:
Sometimes, columns that are supposed to be positive (Quantity
, UnitPrice
in this case) contain some negative values (which represent returned items or cancelled orders). Let's check if there's any, and filter them out to avoid interference with our process.
# Before
print('Before:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))
print('*' *50)
# Remove negative values
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
# After
print('After:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))
Inspecting data types:
We can see some minor issues:
InvoiceDate
column is type "object". We shall change it to type "datetime" for easier calculation.CustomerID
is type 'int64' (float). Although the 'ID' is numerical, its role is actually a label. We shall change it to type 'object' to avoid confusion.# Before
print(df.dtypes)
# Before
print('Before:')
print(df.dtypes)
print('*'*50)
# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst = True)
# Convert CustomerID to str
df['CustomerID'] = df['CustomerID'].astype('int').astype('str')
# After:
print("After:")
print(df.dtypes)
That's it for now! The data is ready for calculations and analyses. Let's carry on to the next part.
In the dataset, the RFM metrics have not been calculated, so we have to do that ourselves ๐งฎ. Before we continue to calculating each segmentation metric (R, F, M), let's create a dataframe that contains all the unique customers (CustomerID).
Later we'll add the R, F, M metrics and the respective clusters to this dataframe.
df_user = pd.DataFrame(df['CustomerID'].unique(), columns = ['CustomerID'])
print('The total number of customers is:', len(df_user))
df_user.head()
The Recency
for each customer is the days difference between their last purchase and the present day (or the a specific ending day of our analysis). Since this set contains data up to 09/12/2011, that will be our ending day.
# setting the end date
end_date = df['InvoiceDate'].max()
end_date
# finding the latest purchases by each customer
df_recency = df.groupby('CustomerID')['InvoiceDate'].max().reset_index()
# Calculating Recency
df_recency['Recency'] = (end_date - df_recency['InvoiceDate']).dt.days
# Add this to score the user dataframe
df_user = pd.merge(df_user, df_recency[['CustomerID', 'Recency']], on = 'CustomerID')
df_user.head()
The Frequency
for each customer is total number of orders they had made.
# Calculating Frequency - or the number of times unique InvoiceNo
df_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
df_frequency.columns = ['CustomerID','Frequency']
# Adding Frequency metric to our user df
df_user = pd.merge(df_user, df_frequency, on = 'CustomerID')
df_user.head()
The Monetary
for each customer is the total values of their orders.
# Calculate revenue for each transaction
df['Monetary'] = df['UnitPrice'] * df['Quantity']
# Calculate revenue for each customer
df_monetary = df.groupby('CustomerID')['Monetary'].sum().reset_index()
# Merge that to our user df
df_user = pd.merge(df_user,df_monetary, on = 'CustomerID')
df_user.head()
Let's explore the R, F, M values.
# Take a quick look at the distribution of R, F, M values
print("This plot is interactive! Move your mouse to interact with it. ๐")
px.histogram(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()
# Using boxplot to see some outliers
print("This plot is interactive! Move your mouse to interact with it. ๐")
px.box(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()
Quick observations: As you can see above, there histograms for Frequency
and Monetary
are highly skewed due to some major outliers ๐คทโโ๏ธ (likely wholesalers). They can potentially interfere with the outcome k-means algorithm, so we need a way to deal with them.
For real businesses, I recommend taking a closer look at these special customers who made 200 orders or paid ยฃ250k a year, or at least deal with their numbers in a more elegant way. But for this project, I'm just going to set a reasonable threshold and drop a small number of outliers to smooth things out for the k-means results.
While scouring the Internet I found this really handy method to determine which outliers to drop. Basically, we compute the values for specific percentiles in the distribution of Frequency
and Monetary
, and plot them in a line graph to see where the values spike up. This way the number of customers to be dropped will be far fewer than with the traditional definition of 'outliers' (those outside the upper and lower bound of $Q3$ and $Q1$.
And voila: we can see the obviously unsual increase after the 99th percentile.
# setting the percentiles
freq_percentile = df_user[['Frequency',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]
money_percentile = df_user[['Monetary',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]
# visualizing on line graphs
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter(name = 'Frequency',x=freq_percentile.index, y=freq_percentile.iloc[:,0]), row=1, col=1)
fig.add_trace(go.Scatter(name = 'Monetary',x=money_percentile.index, y=money_percentile.iloc[:,0]), row=1, col=2)
fig.update_layout(height=440, width=980, title_text="Detecting outliers",
legend=dict(yanchor="top", y=1.2, xanchor="right",x=0.99),template = 'seaborn')
fig.show()
Let's see how many outliers lie outside of the 99% quantile and carry out the dropping. Fortunately we won't have to drop too many customers (only 1-2%), otherwise the segmentation have much meaning.
print(df_user[(df_user['Frequency'] > df_user['Frequency'].quantile(0.99))]['CustomerID'].count(), "frequency outliers")
print(df_user[(df_user['Monetary'] > df_user['Monetary'].quantile(0.99))]['CustomerID'].count(), "monetary outliers")
print('*'*50)
df_user = df_user[(df_user['Frequency'] <= df_user['Frequency'].quantile(0.99)) &
(df_user['Monetary'] <= df_user['Monetary'].quantile(0.99))]
print(df_user.shape[0], 'customers remain after outliers were removed.')
df_user.head()
That's it! Now we're done calculating RFM values for each customer and also cleared out some annoying outliers. The next step will be to apply k-means clustering to divide them into different segments.
Now, before initializing k-means clustering on our dataset, we have to decide the value of $k$, or more specifically: how many segments shall be divide our customers? Existing materials suggest that $k = 4$ (4 clusters) is the magic number. Just to be sure, I'll apply the elbow method in this case. For more details about this method, see this article.
Basically:
Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
sse={}
recency = df_user[['Recency']]
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(recency)
recency["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
px.line(x = list(sse.keys()), y = list(sse.values()), template = 'seaborn',
labels = {'x': 'Number of clusters (k)','y': 'WSS'}, height = 400, width = 800)
Note that this method is only a heuristic does not always produce the most obvious outcome. For our graph, $k=2$ seems more appropriate, but I'm going to set $k=4$ anyway for easier interpretations.
# setting k = 4
k = 4
After doing this for the first time, I realized 2 things:
To put the clusters in either ascending or descending order, we need a function that orders the cluster according to logic. Thankfully, there's one just like that on the internet that I can borrow. I modified the function a bit so it takes one fewer arguments compared to the original.
def order_cluster(target_field_name,df,ascending):
cluster_field_name = target_field_name + 'Cluster'
new_cluster_field_name = 'new_' + cluster_field_name
df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
df_new['index'] = df_new.index
df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
df_final = df_final.drop([cluster_field_name],axis=1)
df_final = df_final.rename(columns={"index":cluster_field_name})
return df_final
Now let's write the kmeans_predict
write function to apply k-means to cluster our dataframe, and return a dataframe with clusters correctly ordered.
def kmeans_predict(target_field_name,df,k, ascending):
# initializing kmeans with k clusters
kmeans = KMeans(n_clusters = k)
# fitting the kmeans cluster to our Recency column
kmeans.fit(df[[target_field_name]])
# assigning the cluster name to each customer
cluster_field_name = target_field_name + 'Cluster'
df[cluster_field_name] = kmeans.predict(df[[target_field_name]])
# set the cluster to logical orders:
df = order_cluster(target_field_name,df,ascending)
df[cluster_field_name] += 1
return df
# Applying the kmeans_predict function
df_user = kmeans_predict('Recency',df_user, k, False)
# Examining the characteristics of each cluster
df_user.groupby('RecencyCluster')['Recency'].describe()
# Applying the kmeans_predict function
df_user = kmeans_predict('Frequency',df_user,k, True)
# Examining the characteristics of each cluster
df_user.groupby('FrequencyCluster')['Frequency'].describe()
# Applying the kmeans_predict function
df_user = kmeans_predict('Monetary', df_user, k, True)
# Examining the characteristics of each cluster
df_user.groupby('MonetaryCluster')['Monetary'].describe()
With each facet Recency
, Frequency
, Monetary
clustered nice and tidy, it's time for us to use those clusters to actually separate our customers into different segments. There's actually 2 ways we can do this.
We can simply add the 3 R, F, M values to make one single score which will range from 3 to 12. We can than classify our customers into 3 segments as follows:
(These are just arbitrary values that I set up. You can set these score to whatever you like.)
# Calculating score
df_user['RFMScore'] = df_user['RecencyCluster'] + df_user['FrequencyCluster'] + df_user['MonetaryCluster']
# Dividing into segments
df_user['SimpleSegment'] = 'Low-Value'
df_user.loc[df_user['RFMScore'] >= 6, 'SimpleSegment'] = 'Mid-Value'
df_user.loc[df_user['RFMScore'] >= 9, 'SimpleSegment'] = 'High-Value'
df_user.head()
# Visualizing our results
print(df_user.groupby('SimpleSegment')['CustomerID'].count().reset_index(),'\n'+'*'*50)
fig = px.scatter(df_user,x = 'Recency', y='Monetary', color = 'SimpleSegment',
size='Frequency', hover_data=['CustomerID'], template = 'seaborn')
fig.update_layout(title = 'Simple Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. ๐")
fig.show()
Our customerbase has been classified into 3 distinct categories:
This method makes use of the popular RFM-grid to different categories based on their specific R, F, M scores (instead of summing all of them together). The categories are as follows:
For more details on this approach, you can see this article here. Since this is not my domain, I'll just replicate the author's approach with my own analysis to illustrate what the results would look like.
# Concatenate R,F and M to produce the the Segment
df_user['RFMScore2'] = df_user['RecencyCluster'].astype('str') + df_user['FrequencyCluster'].astype('str') + df_user['MonetaryCluster'].astype('str')
# Create human friendly RFM labels
segt_map = {
r'[1-2][1-2]': 'Hibernating',
r'[1-2][2-3]': 'At risk',
r'[1-2]4': 'Can\'t loose them',
r'2[1-2]': 'About to sleep',
r'22': 'Need attention',
r'[2-3][3-4]': 'Loyal customers',
r'31': 'Promising',
r'41': 'New customers',
r'[3-4][1-2]': 'Potential loyalists',
r'4[3-4]': 'Champions'
}
df_user['DetailedSegment'] = df_user['RecencyCluster'].map(str) + df_user['FrequencyCluster'].map(str)
df_user['DetailedSegment'] = df_user['DetailedSegment'].replace(segt_map, regex=True)
df_user.head()
# Visualizing out results using a Treemap
segments = df_user.groupby('DetailedSegment')['CustomerID'].count().reset_index()
segments.columns = ['Segment','Number of Customers']
print(segments)
print('*'*50)
fig = px.treemap(segments, path=['Segment'], values='Number of Customers', template = 'seaborn')
fig.update_layout(title = 'Detailed Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. ๐")
fig.show()
Hmm, that doesn't look very satisfying ๐คจ. I suspect it has to do with the fact that RFM-grid usually divides each facet R, F, M into 5 clusters intstead of 4. But at least we know what the final results would look like if we apply this method.
In this project, I have segmented the customerbase for a online retail business in the UK based on RFM values using the k-means clustering algorithm. Although the methods I have used are relatively simple (and have some limitations), the outcome is a useful categorization of the customers based on which practical decisions could be made. This goes to show how much data can contribute to business decision-making. Moreover, the code I've written are highly reusable (with a few modifications) and should save a bunch of time if I ever need to do a similar project.
That said, there's a lot of room for improvement since I'm not an expert in this field, so if there's anything more I can do, you can let me know ๐.
end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')