RFM Segmentation with K-Means Clustering¶

Author: Giang Son Nguyen.
Date: 30.05.2021

InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal. The name of the country where a customer resides.

2. Objectives¶

My tasks is to perform segmentation for the customers of this retail store based on RFM scores (Recency, Frequency, Monetary) - a very popular framework for customer segmentation. For better understanding of the RFM method, I suggest you read further here or here.

To classify the R, F, M scores of each customers, I'll use k-means clustering algorithm. As the name suggests, this algorithm will divide the data points into different clusters and it does so in a way that ensures the data points in the same cluster are as similar to each other as possible. Although this method has several limitations, it's simple and powerful enough to achieve our purpose in an efficient way.
For a visual representations of how this algorithm works, you can check out this video. For a deeper dive into k-means and other clustering methods, I recommend this article.

At the end, our customers will be justly segmented based on their purchase behaviors. As a business, we can use this information to devise appropriate actions, eg: creating reward systems for high-paying customers, or offer discounts for customers who haven't purchased in a long time, etc...

B. DATA PREPARATION¶

1. Importing dataset¶

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import time
start_time= time.process_time()

df_all = pd.read_csv('OnlineRetail.csv',encoding= 'unicode_escape')
print('Data has been loaded!')
print("This dataset has", df_all.shape[0], "rows and", df_all.shape[1], 'columns.')
df_all.head()

Data has been loaded!
This dataset has 541909 rows and 8 columns.

2. Data Wrangling¶

Selecting UK Customers:
As we know, this data set came from a retailer in the UK, and the majority (>90%) of customers are from the UK as well. For interpretation's sake, we'll only analyze the data for UK customers.

print("UK customers:", df_all[df_all['Country'] == 'United Kingdom']['InvoiceNo'].count())
print("Non-UK customers:", df_all[df_all['Country'] != 'United Kingdom']['InvoiceNo'].count())
print('*'*50)
df = df_all[df_all['Country'] == 'United Kingdom']
print("UK Customers selected.")
print("This dataframe now has", df.shape[0], "rows.")
df.head()

UK customers: 495478
Non-UK customers: 46431
**************************************************
UK Customers selected.
This dataframe now has 495478 rows.

Check for missing values:
We see that there's 1454 null entries in the Description column and 133600 missing values in the CustomerID column.

Since our task is to segment customers, transactions without CustomerID will sadly be useless and shall be dropped.
Meanwhile we can safely ignore the missing values in Description because it is of little significance.

# Before
print('Before:')
print(np.sum(df.isna()))
print('*'*50)

# Remove rows with nan
df.dropna(subset = ['CustomerID'], inplace = True)

# After
print('After:')
print(np.sum(df.isna()))

Before:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133600
Country             0
dtype: int64
**************************************************
After:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

Check for negative values:
Sometimes, columns that are supposed to be positive (Quantity, UnitPrice in this case) contain some negative values (which represent returned items or cancelled orders). Let's check if there's any, and filter them out to avoid interference with our process.

# Before
print('Before:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))
print('*' *50)

# Remove negative values
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# After
print('After:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))

Before:
Negative quantities: 7533
Negative unit prices: 24
**************************************************
After:
Negative quantities: 0
Negative unit prices: 0

Inspecting data types:
We can see some minor issues:

The InvoiceDate column is type "object". We shall change it to type "datetime" for easier calculation.
Column CustomerID is type 'int64' (float). Although the 'ID' is numerical, its role is actually a label. We shall change it to type 'object' to avoid confusion.

# Before
print(df.dtypes)

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

# Before
print('Before:')
print(df.dtypes)
print('*'*50)

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst = True)

# Convert CustomerID to str
df['CustomerID'] = df['CustomerID'].astype('int').astype('str')

# After:
print("After:")
print(df.dtypes)

Before:
InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object
**************************************************
After:
InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
Country                object
dtype: object

That's it for now! The data is ready for calculations and analyses. Let's carry on to the next part.

C. CALCULATING RFM¶

In the dataset, the RFM metrics have not been calculated, so we have to do that ourselves 🧮. Before we continue to calculating each segmentation metric (R, F, M), let's create a dataframe that contains all the unique customers (CustomerID).

Later we'll add the R, F, M metrics and the respective clusters to this dataframe.

df_user = pd.DataFrame(df['CustomerID'].unique(), columns = ['CustomerID'])
print('The total number of customers is:', len(df_user))
df_user.head()

The total number of customers is: 3920

1. Recency¶

The Recency for each customer is the days difference between their last purchase and the present day (or the a specific ending day of our analysis). Since this set contains data up to 09/12/2011, that will be our ending day.

# setting the end date
end_date = df['InvoiceDate'].max()
end_date

Timestamp('2011-12-09 12:49:00')

# finding the latest purchases by each customer
df_recency = df.groupby('CustomerID')['InvoiceDate'].max().reset_index()

# Calculating Recency
df_recency['Recency'] = (end_date - df_recency['InvoiceDate']).dt.days

# Add this to score the user dataframe
df_user = pd.merge(df_user, df_recency[['CustomerID', 'Recency']], on = 'CustomerID')
df_user.head()

2. Frequency¶

The Frequency for each customer is total number of orders they had made.

# Calculating Frequency - or the number of times unique InvoiceNo
df_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
df_frequency.columns = ['CustomerID','Frequency']

# Adding Frequency metric to our user df
df_user = pd.merge(df_user, df_frequency, on = 'CustomerID')
df_user.head()

3. Monetary¶

The Monetary for each customer is the total values of their orders.

# Calculate revenue for each transaction
df['Monetary'] = df['UnitPrice'] * df['Quantity']

# Calculate revenue for each customer
df_monetary = df.groupby('CustomerID')['Monetary'].sum().reset_index()


# Merge that to our user df
df_user = pd.merge(df_user,df_monetary, on = 'CustomerID')
df_user.head()

4. Exploratory Data Analysis¶

Let's explore the R, F, M values.

# Take a quick look at the distribution of R, F, M values
print("This plot is interactive! Move your mouse to interact with it. 👆")
px.histogram(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()

This plot is interactive! Move your mouse to interact with it. 👆

# Using boxplot to see some outliers
print("This plot is interactive! Move your mouse to interact with it. 👆")
px.box(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()

This plot is interactive! Move your mouse to interact with it. 👆

Quick observations: As you can see above, there histograms for Frequency and Monetary are highly skewed due to some major outliers 🤷‍♀️ (likely wholesalers). They can potentially interfere with the outcome k-means algorithm, so we need a way to deal with them.

Dealing with outliers¶

For real businesses, I recommend taking a closer look at these special customers who made 200 orders or paid £250k a year, or at least deal with their numbers in a more elegant way. But for this project, I'm just going to set a reasonable threshold and drop a small number of outliers to smooth things out for the k-means results.

While scouring the Internet I found this really handy method to determine which outliers to drop. Basically, we compute the values for specific percentiles in the distribution of Frequency and Monetary, and plot them in a line graph to see where the values spike up. This way the number of customers to be dropped will be far fewer than with the traditional definition of 'outliers' (those outside the upper and lower bound of $Q3$ and $Q1$.

And voila: we can see the obviously unsual increase after the 99th percentile.

# setting the percentiles
freq_percentile = df_user[['Frequency',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]
money_percentile = df_user[['Monetary',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]

# visualizing on line graphs
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter(name = 'Frequency',x=freq_percentile.index, y=freq_percentile.iloc[:,0]), row=1, col=1)
fig.add_trace(go.Scatter(name = 'Monetary',x=money_percentile.index, y=money_percentile.iloc[:,0]), row=1, col=2)
fig.update_layout(height=440, width=980, title_text="Detecting outliers",
                  legend=dict(yanchor="top", y=1.2, xanchor="right",x=0.99),template = 'seaborn')
fig.show()

Let's see how many outliers lie outside of the 99% quantile and carry out the dropping. Fortunately we won't have to drop too many customers (only 1-2%), otherwise the segmentation have much meaning.

print(df_user[(df_user['Frequency'] > df_user['Frequency'].quantile(0.99))]['CustomerID'].count(), "frequency outliers")
print(df_user[(df_user['Monetary'] > df_user['Monetary'].quantile(0.99))]['CustomerID'].count(), "monetary outliers")
print('*'*50)
df_user = df_user[(df_user['Frequency'] <= df_user['Frequency'].quantile(0.99)) & 
        (df_user['Monetary'] <= df_user['Monetary'].quantile(0.99))]

print(df_user.shape[0], 'customers remain after outliers were removed.')
df_user.head()

40 frequency outliers
40 monetary outliers
**************************************************
3861 customers remain after outliers were removed.

That's it! Now we're done calculating RFM values for each customer and also cleared out some annoying outliers. The next step will be to apply k-means clustering to divide them into different segments.

D. K-MEANS CLUSTERING¶

1. Choosing the number of clusters¶

Now, before initializing k-means clustering on our dataset, we have to decide the value of $k$, or more specifically: how many segments shall be divide our customers? Existing materials suggest that $k = 4$ (4 clusters) is the magic number. Just to be sure, I'll apply the elbow method in this case. For more details about this method, see this article.

Basically:

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

sse={}
recency = df_user[['Recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(recency)
    recency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
px.line(x = list(sse.keys()), y = list(sse.values()), template = 'seaborn',
       labels = {'x': 'Number of clusters (k)','y': 'WSS'}, height = 400, width = 800)

Note that this method is only a heuristic does not always produce the most obvious outcome. For our graph, $k=2$ seems more appropriate, but I'm going to set $k=4$ anyway for easier interpretations.

# setting k = 4
k = 4

2. Applying K-means clustering¶

After doing this for the first time, I realized 2 things:

If I only apply k-means without further adjustments, the clusters won't be in the right order. It would look something like this: whereas ideally the clusters should be: 1: 0-52, 2: 53-138, 3: 139-249, 4: 250-373

I was repeating the same steps several times. So let's be DRY (Don't Repeat Yourself) instead of WET (Write Everything Twice) and write ourselves a function to take care of this task.

To put the clusters in either ascending or descending order, we need a function that orders the cluster according to logic. Thankfully, there's one just like that on the internet that I can borrow. I modified the function a bit so it takes one fewer arguments compared to the original.

def order_cluster(target_field_name,df,ascending):
    cluster_field_name = target_field_name + 'Cluster'
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

Now let's write the kmeans_predict write function to apply k-means to cluster our dataframe, and return a dataframe with clusters correctly ordered.

def kmeans_predict(target_field_name,df,k, ascending):
    # initializing kmeans with k clusters
    kmeans = KMeans(n_clusters = k)
    # fitting the kmeans cluster to our Recency column
    kmeans.fit(df[[target_field_name]])
    # assigning the cluster name to each customer 
    cluster_field_name = target_field_name + 'Cluster'
    df[cluster_field_name] = kmeans.predict(df[[target_field_name]])
    # set the cluster to logical orders:
    df = order_cluster(target_field_name,df,ascending)
    df[cluster_field_name] += 1 
    return df

a. Recency Cluster¶

# Applying the kmeans_predict function
df_user = kmeans_predict('Recency',df_user, k, False)

# Examining the characteristics of each cluster
df_user.groupby('RecencyCluster')['Recency'].describe()

b. Frequency Cluster¶

# Applying the kmeans_predict function
df_user = kmeans_predict('Frequency',df_user,k, True)

# Examining the characteristics of each cluster
df_user.groupby('FrequencyCluster')['Frequency'].describe()

c. Monetary Cluster¶

# Applying the kmeans_predict function
df_user = kmeans_predict('Monetary', df_user, k, True)

# Examining the characteristics of each cluster
df_user.groupby('MonetaryCluster')['Monetary'].describe()

E. SEGMENTATION¶

With each facet Recency, Frequency, Monetary clustered nice and tidy, it's time for us to use those clusters to actually separate our customers into different segments. There's actually 2 ways we can do this.

1. Simple Segments¶

We can simply add the 3 R, F, M values to make one single score which will range from 3 to 12. We can than classify our customers into 3 segments as follows:

3 to 5: Low value 😃
6 to 8: Mid Value 😍
9+: High Value 🤑

(These are just arbitrary values that I set up. You can set these score to whatever you like.)

# Calculating score
df_user['RFMScore'] = df_user['RecencyCluster'] + df_user['FrequencyCluster'] + df_user['MonetaryCluster']

# Dividing into segments
df_user['SimpleSegment'] = 'Low-Value'
df_user.loc[df_user['RFMScore'] >= 6, 'SimpleSegment'] = 'Mid-Value'
df_user.loc[df_user['RFMScore'] >= 9, 'SimpleSegment'] = 'High-Value'

df_user.head()

# Visualizing our results
print(df_user.groupby('SimpleSegment')['CustomerID'].count().reset_index(),'\n'+'*'*50)
fig = px.scatter(df_user,x = 'Recency', y='Monetary', color = 'SimpleSegment',
                 size='Frequency', hover_data=['CustomerID'], template = 'seaborn')
fig.update_layout(title = 'Simple Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. 👆")
fig.show()

  SimpleSegment  CustomerID
0    High-Value         559
1     Low-Value        1456
2     Mid-Value        1846 
**************************************************
This plot is interactive! Move your mouse to interact with it. 👆

Our customerbase has been classified into 3 distinct categories:

High-value customers are located on the left side (bought recently), the upper corner (paid large amounts of money) and are depicted by large bubbles (made a lot of purchases).
Low-values customers are far to the right (hadn't purchased anything for a while), in the lower corner (low monetary values) and have small bubbles (rarely shopped).
Mid-value customers are the ones in-between.

2. Detailed Segments¶

This method makes use of the popular RFM-grid to different categories based on their specific R, F, M scores (instead of summing all of them together). The categories are as follows:

For more details on this approach, you can see this article here. Since this is not my domain, I'll just replicate the author's approach with my own analysis to illustrate what the results would look like.

# Concatenate R,F and M to produce the the Segment
df_user['RFMScore2'] = df_user['RecencyCluster'].astype('str') + df_user['FrequencyCluster'].astype('str') + df_user['MonetaryCluster'].astype('str')

# Create human friendly RFM labels
segt_map = {
    r'[1-2][1-2]': 'Hibernating',
    r'[1-2][2-3]': 'At risk',
    r'[1-2]4': 'Can\'t loose them',
    r'2[1-2]': 'About to sleep',
    r'22': 'Need attention',
    r'[2-3][3-4]': 'Loyal customers',
    r'31': 'Promising',
    r'41': 'New customers',
    r'[3-4][1-2]': 'Potential loyalists',
    r'4[3-4]': 'Champions'
}
df_user['DetailedSegment'] = df_user['RecencyCluster'].map(str) + df_user['FrequencyCluster'].map(str)
df_user['DetailedSegment'] = df_user['DetailedSegment'].replace(segt_map, regex=True)

df_user.head()

# Visualizing out results using a Treemap
segments = df_user.groupby('DetailedSegment')['CustomerID'].count().reset_index()
segments.columns = ['Segment','Number of Customers']
print(segments)
print('*'*50)
fig = px.treemap(segments, path=['Segment'], values='Number of Customers', template = 'seaborn')
fig.update_layout(title = 'Detailed Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. 👆")
fig.show()

               Segment  Number of Customers
0              At risk                   10
1            Champions                  517
2          Hibernating                  978
3      Loyal customers                   51
4        New customers                  700
5  Potential loyalists                 1071
6            Promising                  534
**************************************************
This plot is interactive! Move your mouse to interact with it. 👆

Hmm, that doesn't look very satisfying 🤨. I suspect it has to do with the fact that RFM-grid usually divides each facet R, F, M into 5 clusters intstead of 4. But at least we know what the final results would look like if we apply this method.

F. CONCLUSION¶

In this project, I have segmented the customerbase for a online retail business in the UK based on RFM values using the k-means clustering algorithm. Although the methods I have used are relatively simple (and have some limitations), the outcome is a useful categorization of the customers based on which practical decisions could be made. This goes to show how much data can contribute to business decision-making. Moreover, the code I've written are highly reusable (with a few modifications) and should save a bunch of time if I ever need to do a similar project.

That said, there's a lot of room for improvement since I'm not an expert in this field, so if there's anything more I can do, you can let me know 😉.

end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')

This notebook was run in:  19.31  seconds

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	01-12-2010 08:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	01-12-2010 08:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	01-12-2010 08:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	01-12-2010 08:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	01-12-2010 08:26	3.39	17850.0	United Kingdom

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	01-12-2010 08:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	01-12-2010 08:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	01-12-2010 08:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	01-12-2010 08:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	01-12-2010 08:26	3.39	17850.0	United Kingdom

	CustomerID
0	17850
1	13047
2	13748
3	15100
4	15291

	CustomerID	Recency
0	17850	371
1	13047	31
2	13748	95
3	15100	333
4	15291	25

	CustomerID	Recency	Frequency	Monetary
0	17850	371	34	5391.21
1	13047	31	10	3237.54
2	13748	95	5	948.25
3	15100	333	3	876.00
4	15291	25	15	4668.30