RFM Segmentation with K-Means Clustering

Author: Giang Son Nguyen.
Date: 30.05.2021

A. OVERVIEW

1. Dataset

In this project, I'm using the Online Retail dataset from here. This dataset contains the transactions in the period between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. It's also a classic dataset for this this type of task.

The descriptions for each column is as below:

  • InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
  • StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
  • UnitPrice: Unit price. Numeric. Product price per unit in sterling (ร‚ยฃ).
  • CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal. The name of the country where a customer resides.

2. Objectives

My tasks is to perform segmentation for the customers of this retail store based on RFM scores (Recency, Frequency, Monetary) - a very popular framework for customer segmentation. For better understanding of the RFM method, I suggest you read further here or here.

To classify the R, F, M scores of each customers, I'll use k-means clustering algorithm. As the name suggests, this algorithm will divide the data points into different clusters and it does so in a way that ensures the data points in the same cluster are as similar to each other as possible. Although this method has several limitations, it's simple and powerful enough to achieve our purpose in an efficient way.
For a visual representations of how this algorithm works, you can check out this video. For a deeper dive into k-means and other clustering methods, I recommend this article.

At the end, our customers will be justly segmented based on their purchase behaviors. As a business, we can use this information to devise appropriate actions, eg: creating reward systems for high-paying customers, or offer discounts for customers who haven't purchased in a long time, etc...

B. DATA PREPARATION

1. Importing dataset

In [1]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import time
start_time= time.process_time()
In [2]:
df_all = pd.read_csv('OnlineRetail.csv',encoding= 'unicode_escape')
print('Data has been loaded!')
print("This dataset has", df_all.shape[0], "rows and", df_all.shape[1], 'columns.')
df_all.head()
Data has been loaded!
This dataset has 541909 rows and 8 columns.
Out[2]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 01-12-2010 08:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 01-12-2010 08:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 01-12-2010 08:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 01-12-2010 08:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 01-12-2010 08:26 3.39 17850.0 United Kingdom

2. Data Wrangling

Selecting UK Customers:
As we know, this data set came from a retailer in the UK, and the majority (>90%) of customers are from the UK as well. For interpretation's sake, we'll only analyze the data for UK customers.

In [3]:
print("UK customers:", df_all[df_all['Country'] == 'United Kingdom']['InvoiceNo'].count())
print("Non-UK customers:", df_all[df_all['Country'] != 'United Kingdom']['InvoiceNo'].count())
print('*'*50)
df = df_all[df_all['Country'] == 'United Kingdom']
print("UK Customers selected.")
print("This dataframe now has", df.shape[0], "rows.")
df.head()
UK customers: 495478
Non-UK customers: 46431
**************************************************
UK Customers selected.
This dataframe now has 495478 rows.
Out[3]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 01-12-2010 08:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 01-12-2010 08:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 01-12-2010 08:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 01-12-2010 08:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 01-12-2010 08:26 3.39 17850.0 United Kingdom

Check for missing values:
We see that there's 1454 null entries in the Description column and 133600 missing values in the CustomerID column.

  • Since our task is to segment customers, transactions without CustomerID will sadly be useless and shall be dropped.
  • Meanwhile we can safely ignore the missing values in Description because it is of little significance.
In [4]:
# Before
print('Before:')
print(np.sum(df.isna()))
print('*'*50)

# Remove rows with nan
df.dropna(subset = ['CustomerID'], inplace = True)

# After
print('After:')
print(np.sum(df.isna()))
Before:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133600
Country             0
dtype: int64
**************************************************
After:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

Check for negative values:
Sometimes, columns that are supposed to be positive (Quantity, UnitPrice in this case) contain some negative values (which represent returned items or cancelled orders). Let's check if there's any, and filter them out to avoid interference with our process.

In [5]:
# Before
print('Before:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))
print('*' *50)

# Remove negative values
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# After
print('After:')
print('Negative quantities:', np.sum(df['Quantity'] <= 0))
print('Negative unit prices:', np.sum(df['UnitPrice'] <= 0))
Before:
Negative quantities: 7533
Negative unit prices: 24
**************************************************
After:
Negative quantities: 0
Negative unit prices: 0

Inspecting data types:
We can see some minor issues:

  • The InvoiceDate column is type "object". We shall change it to type "datetime" for easier calculation.
  • Column CustomerID is type 'int64' (float). Although the 'ID' is numerical, its role is actually a label. We shall change it to type 'object' to avoid confusion.
In [6]:
# Before
print(df.dtypes)
InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object
In [7]:
# Before
print('Before:')
print(df.dtypes)
print('*'*50)

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], dayfirst = True)

# Convert CustomerID to str
df['CustomerID'] = df['CustomerID'].astype('int').astype('str')

# After:
print("After:")
print(df.dtypes)
Before:
InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object
**************************************************
After:
InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
Country                object
dtype: object

That's it for now! The data is ready for calculations and analyses. Let's carry on to the next part.

C. CALCULATING RFM

In the dataset, the RFM metrics have not been calculated, so we have to do that ourselves ๐Ÿงฎ. Before we continue to calculating each segmentation metric (R, F, M), let's create a dataframe that contains all the unique customers (CustomerID).

Later we'll add the R, F, M metrics and the respective clusters to this dataframe.

In [8]:
df_user = pd.DataFrame(df['CustomerID'].unique(), columns = ['CustomerID'])
print('The total number of customers is:', len(df_user))
df_user.head()
The total number of customers is: 3920
Out[8]:
CustomerID
0 17850
1 13047
2 13748
3 15100
4 15291

1. Recency

The Recency for each customer is the days difference between their last purchase and the present day (or the a specific ending day of our analysis). Since this set contains data up to 09/12/2011, that will be our ending day.

In [9]:
# setting the end date
end_date = df['InvoiceDate'].max()
end_date
Out[9]:
Timestamp('2011-12-09 12:49:00')
In [10]:
# finding the latest purchases by each customer
df_recency = df.groupby('CustomerID')['InvoiceDate'].max().reset_index()

# Calculating Recency
df_recency['Recency'] = (end_date - df_recency['InvoiceDate']).dt.days

# Add this to score the user dataframe
df_user = pd.merge(df_user, df_recency[['CustomerID', 'Recency']], on = 'CustomerID')
df_user.head()
Out[10]:
CustomerID Recency
0 17850 371
1 13047 31
2 13748 95
3 15100 333
4 15291 25

2. Frequency

The Frequency for each customer is total number of orders they had made.

In [11]:
# Calculating Frequency - or the number of times unique InvoiceNo
df_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
df_frequency.columns = ['CustomerID','Frequency']

# Adding Frequency metric to our user df
df_user = pd.merge(df_user, df_frequency, on = 'CustomerID')
df_user.head()
Out[11]:
CustomerID Recency Frequency
0 17850 371 34
1 13047 31 10
2 13748 95 5
3 15100 333 3
4 15291 25 15

3. Monetary

The Monetary for each customer is the total values of their orders.

In [12]:
# Calculate revenue for each transaction
df['Monetary'] = df['UnitPrice'] * df['Quantity']

# Calculate revenue for each customer
df_monetary = df.groupby('CustomerID')['Monetary'].sum().reset_index()


# Merge that to our user df
df_user = pd.merge(df_user,df_monetary, on = 'CustomerID')
df_user.head()
Out[12]:
CustomerID Recency Frequency Monetary
0 17850 371 34 5391.21
1 13047 31 10 3237.54
2 13748 95 5 948.25
3 15100 333 3 876.00
4 15291 25 15 4668.30

4. Exploratory Data Analysis

Let's explore the R, F, M values.

In [13]:
# Take a quick look at the distribution of R, F, M values
print("This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†")
px.histogram(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.histogram(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()
This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†
In [14]:
# Using boxplot to see some outliers
print("This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†")
px.box(df_user,x = 'Recency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Frequency', template = 'seaborn',width=600, height=300).show()
px.box(df_user,x = 'Monetary', template = 'seaborn',width=600, height=300).show()
This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†

Quick observations: As you can see above, there histograms for Frequency and Monetary are highly skewed due to some major outliers ๐Ÿคทโ€โ™€๏ธ (likely wholesalers). They can potentially interfere with the outcome k-means algorithm, so we need a way to deal with them.

Dealing with outliers

For real businesses, I recommend taking a closer look at these special customers who made 200 orders or paid ยฃ250k a year, or at least deal with their numbers in a more elegant way. But for this project, I'm just going to set a reasonable threshold and drop a small number of outliers to smooth things out for the k-means results.

While scouring the Internet I found this really handy method to determine which outliers to drop. Basically, we compute the values for specific percentiles in the distribution of Frequency and Monetary, and plot them in a line graph to see where the values spike up. This way the number of customers to be dropped will be far fewer than with the traditional definition of 'outliers' (those outside the upper and lower bound of $Q3$ and $Q1$.

And voila: we can see the obviously unsual increase after the 99th percentile.

In [15]:
# setting the percentiles
freq_percentile = df_user[['Frequency',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]
money_percentile = df_user[['Monetary',]].describe(percentiles=[0.01,0.02,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.98,0.99])[4:]

# visualizing on line graphs
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter(name = 'Frequency',x=freq_percentile.index, y=freq_percentile.iloc[:,0]), row=1, col=1)
fig.add_trace(go.Scatter(name = 'Monetary',x=money_percentile.index, y=money_percentile.iloc[:,0]), row=1, col=2)
fig.update_layout(height=440, width=980, title_text="Detecting outliers",
                  legend=dict(yanchor="top", y=1.2, xanchor="right",x=0.99),template = 'seaborn')
fig.show()

Let's see how many outliers lie outside of the 99% quantile and carry out the dropping. Fortunately we won't have to drop too many customers (only 1-2%), otherwise the segmentation have much meaning.

In [16]:
print(df_user[(df_user['Frequency'] > df_user['Frequency'].quantile(0.99))]['CustomerID'].count(), "frequency outliers")
print(df_user[(df_user['Monetary'] > df_user['Monetary'].quantile(0.99))]['CustomerID'].count(), "monetary outliers")
print('*'*50)
df_user = df_user[(df_user['Frequency'] <= df_user['Frequency'].quantile(0.99)) & 
        (df_user['Monetary'] <= df_user['Monetary'].quantile(0.99))]

print(df_user.shape[0], 'customers remain after outliers were removed.')
df_user.head()
40 frequency outliers
40 monetary outliers
**************************************************
3861 customers remain after outliers were removed.
Out[16]:
CustomerID Recency Frequency Monetary
1 13047 31 10 3237.54
2 13748 95 5 948.25
3 15100 333 3 876.00
4 15291 25 15 4668.30
5 14688 7 21 5630.87

That's it! Now we're done calculating RFM values for each customer and also cleared out some annoying outliers. The next step will be to apply k-means clustering to divide them into different segments.

D. K-MEANS CLUSTERING

1. Choosing the number of clusters

Now, before initializing k-means clustering on our dataset, we have to decide the value of $k$, or more specifically: how many segments shall be divide our customers? Existing materials suggest that $k = 4$ (4 clusters) is the magic number. Just to be sure, I'll apply the elbow method in this case. For more details about this method, see this article.

Basically:

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

In [17]:
sse={}
recency = df_user[['Recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(recency)
    recency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
px.line(x = list(sse.keys()), y = list(sse.values()), template = 'seaborn',
       labels = {'x': 'Number of clusters (k)','y': 'WSS'}, height = 400, width = 800)

Note that this method is only a heuristic does not always produce the most obvious outcome. For our graph, $k=2$ seems more appropriate, but I'm going to set $k=4$ anyway for easier interpretations.

In [18]:
# setting k = 4
k = 4 

2. Applying K-means clustering

After doing this for the first time, I realized 2 things:

  1. If I only apply k-means without further adjustments, the clusters won't be in the right order. It would look something like this: image-2.png whereas ideally the clusters should be: 1: 0-52, 2: 53-138, 3: 139-249, 4: 250-373
  1. I was repeating the same steps several times. So let's be DRY (Don't Repeat Yourself) instead of WET (Write Everything Twice) and write ourselves a function to take care of this task.

To put the clusters in either ascending or descending order, we need a function that orders the cluster according to logic. Thankfully, there's one just like that on the internet that I can borrow. I modified the function a bit so it takes one fewer arguments compared to the original.

In [19]:
def order_cluster(target_field_name,df,ascending):
    cluster_field_name = target_field_name + 'Cluster'
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

Now let's write the kmeans_predict write function to apply k-means to cluster our dataframe, and return a dataframe with clusters correctly ordered.

In [20]:
def kmeans_predict(target_field_name,df,k, ascending):
    # initializing kmeans with k clusters
    kmeans = KMeans(n_clusters = k)
    # fitting the kmeans cluster to our Recency column
    kmeans.fit(df[[target_field_name]])
    # assigning the cluster name to each customer 
    cluster_field_name = target_field_name + 'Cluster'
    df[cluster_field_name] = kmeans.predict(df[[target_field_name]])
    # set the cluster to logical orders:
    df = order_cluster(target_field_name,df,ascending)
    df[cluster_field_name] += 1 
    return df

a. Recency Cluster

In [21]:
# Applying the kmeans_predict function
df_user = kmeans_predict('Recency',df_user, k, False)

# Examining the characteristics of each cluster
df_user.groupby('RecencyCluster')['Recency'].describe()
Out[21]:
count mean std min 25% 50% 75% max
RecencyCluster
1 440.0 306.950000 39.554723 250.0 272.00 302.0 336.0 373.0
2 548.0 192.374088 31.359399 139.0 167.75 190.0 217.0 249.0
3 891.0 84.542088 23.985978 53.0 64.00 78.0 103.0 138.0
4 1982.0 20.355197 14.874673 0.0 8.00 18.0 31.0 52.0

b. Frequency Cluster

In [22]:
# Applying the kmeans_predict function
df_user = kmeans_predict('Frequency',df_user,k, True)

# Examining the characteristics of each cluster
df_user.groupby('FrequencyCluster')['Frequency'].describe()
Out[22]:
count mean std min 25% 50% 75% max
FrequencyCluster
1 2094.0 1.355778 0.478863 1.0 1.0 1.0 2.0 2.0
2 1189.0 4.058032 1.048887 3.0 3.0 4.0 5.0 6.0
3 439.0 9.054670 1.935423 7.0 7.0 9.0 11.0 13.0
4 139.0 18.604317 3.986606 14.0 15.0 18.0 21.0 28.0

c. Monetary Cluster

In [23]:
# Applying the kmeans_predict function
df_user = kmeans_predict('Monetary', df_user, k, True)

# Examining the characteristics of each cluster
df_user.groupby('MonetaryCluster')['Monetary'].describe()
Out[23]:
count mean std min 25% 50% 75% max
MonetaryCluster
1 2708.0 472.493760 312.332094 3.75 215.0725 386.29 680.6125 1253.36
2 819.0 2039.739940 585.153248 1255.00 1535.6500 1912.67 2476.6400 3432.80
3 269.0 4840.561450 1146.651840 3435.76 3844.2200 4559.15 5653.8200 7700.23
4 65.0 10729.037538 2456.751529 7792.36 8694.2600 10464.85 12245.9600 17256.85

E. SEGMENTATION

With each facet Recency, Frequency, Monetary clustered nice and tidy, it's time for us to use those clusters to actually separate our customers into different segments. There's actually 2 ways we can do this.

1. Simple Segments

We can simply add the 3 R, F, M values to make one single score which will range from 3 to 12. We can than classify our customers into 3 segments as follows:

  • 3 to 5: Low value ๐Ÿ˜ƒ
  • 6 to 8: Mid Value ๐Ÿ˜
  • 9+: High Value ๐Ÿค‘

(These are just arbitrary values that I set up. You can set these score to whatever you like.)

In [24]:
# Calculating score
df_user['RFMScore'] = df_user['RecencyCluster'] + df_user['FrequencyCluster'] + df_user['MonetaryCluster']

# Dividing into segments
df_user['SimpleSegment'] = 'Low-Value'
df_user.loc[df_user['RFMScore'] >= 6, 'SimpleSegment'] = 'Mid-Value'
df_user.loc[df_user['RFMScore'] >= 9, 'SimpleSegment'] = 'High-Value'

df_user.head()
Out[24]:
CustomerID Recency Frequency Monetary RecencyCluster FrequencyCluster MonetaryCluster RFMScore SimpleSegment
0 13047 31 10 3237.54 4 3 2 9 High-Value
1 17924 11 7 2962.50 4 3 2 9 High-Value
2 16218 29 7 3084.68 4 3 2 9 High-Value
3 13758 11 7 3190.55 4 3 2 9 High-Value
4 18144 7 12 2888.75 4 3 2 9 High-Value
In [25]:
# Visualizing our results
print(df_user.groupby('SimpleSegment')['CustomerID'].count().reset_index(),'\n'+'*'*50)
fig = px.scatter(df_user,x = 'Recency', y='Monetary', color = 'SimpleSegment',
                 size='Frequency', hover_data=['CustomerID'], template = 'seaborn')
fig.update_layout(title = 'Simple Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†")
fig.show()
  SimpleSegment  CustomerID
0    High-Value         559
1     Low-Value        1456
2     Mid-Value        1846 
**************************************************
This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†

Our customerbase has been classified into 3 distinct categories:

  • High-value customers are located on the left side (bought recently), the upper corner (paid large amounts of money) and are depicted by large bubbles (made a lot of purchases).
  • Low-values customers are far to the right (hadn't purchased anything for a while), in the lower corner (low monetary values) and have small bubbles (rarely shopped).
  • Mid-value customers are the ones in-between.

2. Detailed Segments

This method makes use of the popular RFM-grid to different categories based on their specific R, F, M scores (instead of summing all of them together). The categories are as follows: image.png

For more details on this approach, you can see this article here. Since this is not my domain, I'll just replicate the author's approach with my own analysis to illustrate what the results would look like.

In [26]:
# Concatenate R,F and M to produce the the Segment
df_user['RFMScore2'] = df_user['RecencyCluster'].astype('str') + df_user['FrequencyCluster'].astype('str') + df_user['MonetaryCluster'].astype('str')

# Create human friendly RFM labels
segt_map = {
    r'[1-2][1-2]': 'Hibernating',
    r'[1-2][2-3]': 'At risk',
    r'[1-2]4': 'Can\'t loose them',
    r'2[1-2]': 'About to sleep',
    r'22': 'Need attention',
    r'[2-3][3-4]': 'Loyal customers',
    r'31': 'Promising',
    r'41': 'New customers',
    r'[3-4][1-2]': 'Potential loyalists',
    r'4[3-4]': 'Champions'
}
df_user['DetailedSegment'] = df_user['RecencyCluster'].map(str) + df_user['FrequencyCluster'].map(str)
df_user['DetailedSegment'] = df_user['DetailedSegment'].replace(segt_map, regex=True)

df_user.head()
Out[26]:
CustomerID Recency Frequency Monetary RecencyCluster FrequencyCluster MonetaryCluster RFMScore SimpleSegment RFMScore2 DetailedSegment
0 13047 31 10 3237.54 4 3 2 9 High-Value 432 Champions
1 17924 11 7 2962.50 4 3 2 9 High-Value 432 Champions
2 16218 29 7 3084.68 4 3 2 9 High-Value 432 Champions
3 13758 11 7 3190.55 4 3 2 9 High-Value 432 Champions
4 18144 7 12 2888.75 4 3 2 9 High-Value 432 Champions
In [27]:
# Visualizing out results using a Treemap
segments = df_user.groupby('DetailedSegment')['CustomerID'].count().reset_index()
segments.columns = ['Segment','Number of Customers']
print(segments)
print('*'*50)
fig = px.treemap(segments, path=['Segment'], values='Number of Customers', template = 'seaborn')
fig.update_layout(title = 'Detailed Segmentation of Customers')
print("This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†")
fig.show()
               Segment  Number of Customers
0              At risk                   10
1            Champions                  517
2          Hibernating                  978
3      Loyal customers                   51
4        New customers                  700
5  Potential loyalists                 1071
6            Promising                  534
**************************************************
This plot is interactive! Move your mouse to interact with it. ๐Ÿ‘†

Hmm, that doesn't look very satisfying ๐Ÿคจ. I suspect it has to do with the fact that RFM-grid usually divides each facet R, F, M into 5 clusters intstead of 4. But at least we know what the final results would look like if we apply this method.

F. CONCLUSION

In this project, I have segmented the customerbase for a online retail business in the UK based on RFM values using the k-means clustering algorithm. Although the methods I have used are relatively simple (and have some limitations), the outcome is a useful categorization of the customers based on which practical decisions could be made. This goes to show how much data can contribute to business decision-making. Moreover, the code I've written are highly reusable (with a few modifications) and should save a bunch of time if I ever need to do a similar project.

That said, there's a lot of room for improvement since I'm not an expert in this field, so if there's anything more I can do, you can let me know ๐Ÿ˜‰.

In [28]:
end_time = time.process_time()
elapsed_time = end_time - start_time
print('This notebook was run in: ', round(elapsed_time,2), ' seconds')
This notebook was run in:  19.31  seconds