Customer Segmentation Report and Mailout campaign prediction.

Todor Mishinev
Nerd For Tech
Published in
7 min readApr 28, 2021

--

Project Overview

The project is divided in two parts. First part uses unsupervised learning techniques to identified common clusters between the general population of Germany and the current customers of the company. The goal is to extract parts of the general population which is more likely to become a customer.

The second part uses supervised learning to train model and predict data collected from mailout campaign.

The idea is to target the customers more efficiently which contributes to less expenses for the company and reduces the unwanted mail for the customers.

As a final step the results from the supervised part is submitted to Kaggle competition and evaluated compared to the other models.

Data

The data for this project is real life data delivered by Arvato financial solutions as a part of the final Udacity Data Science Nanodegree project.

  • AZDIAS.csv: General population data of Germany; 891,211 persons (rows) x 366 features (columns).
  • CUSTOMERS.csv: Data for customers of a mail-order company; 191,652 persons (rows) x 369 features (columns).
  • MAILOUT_TRAIN.csv: Data for individuals who were targets of a marketing campaign; 42,982 persons (rows) x 366 (columns) plus additional one with labels indicating the RESPONSE of the subjects.
  • MAILOUT_TEST.csv: Data for individuals who were targets of a marketing campaign; 42,833 persons (rows) x 366 (columns). Used later for prediction and submission to the Kaggle competition.

Two additional files with metadata are presented to help the analysis and feature definition.

Data Preparation

All datasets contain the same feature set except the ‘Customers’ where 3 additional columns are added. ‘PRODUCT_GROUP’, ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’. Currently we’ll use the common 366 features and create a unified way to clean and impute missing values.

Datasets shape:

azdias dimensions: (891221, 367)
customers dimensions: (191652, 367)

Azdias dataset:

Customers dataset:

Columns ‘Unnamed’ and ‘LNR’ seem to be the repetition of the index and id of the person of the dataset.

Object type variables:

CAMEO_DEU_2015                45
CAMEO_DEUG_2015 19
CAMEO_INTL_2015 43
D19_LETZTER_KAUF_BRANCHE 35
EINGEFUEGT_AM 5162
OST_WEST_KZ 2

There are few object type variables which should be treated differently.

CAMEO features contain some special characters which should be removed:

[nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'
'9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'
'9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'
'5F' '1C' 'XX']
[nan 8.0 4.0 2.0 6.0 1.0 9.0 5.0 7.0 3.0 '4' '3' '7' '2' '8' '9' '6' '5'
'1' 'X']
[nan 51.0 24.0 12.0 43.0 54.0 22.0 14.0 13.0 15.0 33.0 41.0 34.0 55.0 25.0
23.0 31.0 52.0 35.0 45.0 44.0 32.0 '22' '24' '41' '12' '54' '51' '44'
'35' '23' '25' '14' '34' '52' '55' '31' '32' '15' '13' '43' '33' '45'
'XX']

EINGEFUGT_AM is a date variable and will be transformed to Year.

OST_WEST_KZ indicates if the person is from east or west part of Germany and will be converted to binary.

I will drop D19_LETZTER_KAUF_BRANCHE and CAMEO_DEU_2015 for now because they need to be encoded but also are quite segmented. We may experiment if ‘Label encoding’ them will lead to higher score in the competition.

Missing values:

First we will use the Attribute.xls file with feature values description to unify the unknown values (0,-1,9) and replace them with constant -1 value.

Fig. 1 Percentage missing values per feature
Fig.2 Missing values per row histogram

Let’s take a look at the features with the most missing values:

The features with most of the missing data ALTER_KIND1–4 seams to mark the age of first to forth child. We may remove those features but actually they might be quite significant if the company offers products related to households with children. So I will keep all of the features (Experimenting later with the Kaggle competition shows that removing some of those features leads to lower overall score).

For all other features with less missing values I will use Simple Imputer with ‘constant’ -1 strategy.

Lets take a look at the distribution between some of the features in the Azdias and Customers datasets:

Fig. 3 Decade feature distribution

The Customer segment is quite dominant around 2 and 3 (people born in 50s and 60s)

Fig. 4 Income feature distribution

Customers are most represented by value of 2 : Very high income

Fig. 5 Vacation habits feature distribution

GFK_URLAUBERTYP explains the vacations habits of the people. There is quite high representation with value of 5 which corresponds to ‘Nature fans’.

Part 1: Unsupervised model

In this part first we will use dimensionality reduction technique called PCA and try to reduce the feature set while keeping most of the variance in the data. This will help to improve the clustering efficiency.

Secondly KMeans++ will be trained with general population data and predict the cluster segments.

I will use Sklearn Pipeline to perform the data preprocessing (imputing + scaling the data).

After preprocessing the data we’ll keep 90% of the variance for the clustering (around 130 of the PCA components).

Fig. 6 PCA components variance ratio

Before training the clustering algorithm we should identify how many clusters will be the optimal split. We will use the so called Elbow method:

Fig. 7 Elbow Method Curve

According to the curve it looks like 8 clusters might be a good staring point. After fitting and predicting the clusters we get the final distribution plot:

Fig. 8 Data distribution between Azdias and Customers clusters

Lets take a look at the differences between features between cluster 4 (less customers) and cluster 0:

CAMEO_DEUG_2015: Customers are more likely to be established middle class against Cluster 4 values : low-consumption middleclass.

CAMEO_INTL_2015: Prosperous households are more likely to be customers.

ANZ_PERSONEN: Households with more adult people use the company’s products.

ALTER_HH: The main age within the households of customers is lower than the underrepresented cluster.

Part 2: Supervised model

In this part we will compare few models and select one of them to optimize and use for the final challenge. Data preparation will be as in part 1.

Let’s see the RESPONSE label distribution:

Fig.9 RESPONSE Distribution

The training data is highly imbalanced. That’s why ROC AUC metric is used for evaluation based on True positive and False positive rates.

Les test 4 models with default parameters and plot the results:

Fig. 10 Initial test models ROC Curve

It looks like the best performer is LGBMClassifier and the worst one is the RandomForestClassifier model. After some submissions best results actually were provided by XGBClassifier so I will continue with Bayesian Optimization to tune the hyper parameters.

Once the model is tuned with BayesSearchCV using the following grid:

We can see quite an improvement after only 10 steps of the optimization:

Fig. 11 Optimized XGB model ROC Curve

Conclusion

The unsupervised analysis provides overall representation of the similar groups between global population and the customers. Still it can be tested with slightly different number of clusters.

The supervised part (mailout campaign) model provided good score (4th in the leaderboard).

Ideas for improvement of the model performance:

  • Use Different kind of data imputation i.e. Multiple Imputation. It may provide better results but it’s quite computational expensive.
  • SMOTE or other methods to oversample the minority class.
  • XGB feature importance or statistical methods like chi2 for feature selection
  • Add some of the customers data similar to the minority class data samples to the training data.

Link to Github repository

--

--