SPARKIFY Customer Churn Prediction

14 min readNov 8, 2020

Sparkify Customer Churn Prediction Project — Udacity

Welcome to my blog on Sparkify Customer Churn Prediction Project. It is a part of UDACITY’s Data Scientist program as a Capstone Project. Let me put down a summary of this project and my approach to this project.

Project Overview

The Sparkify project is a way to learn and manipulate large and realistic datasets with Spark to engineer relevant features for predicting the customer churn i.e. if a customer changes the state of business with the firm. I learned how to use Spark MLib to build machine learning models with large datasets, enough large to not be able to do it with non-distributed technologies like scikit-learn.

Project Definition / Problem Statement

Sparkify: It is a fictitious music app that has provided 2 months of the user behavior log data. The end-goal of this project was is to study customer behavior and predict the user churn and try to retain the user. The technologies used in this project include Python, SQL, PySpark, Machine Learning, Spark SQL, and many more. The main aspect was to predict the churn of users and prevent/regain the lost customers to get them back. The project also included data analysis and visualization, feature engineering, and model tuning.

Data Preprocessing
The format of data provided by Sparkify is a json file. It starts by using the Spark sessions and trying to upload the heavy json files using it. The original dataset is 12 GB in size but for this project, a subset has been used which is around 120 MB. This data consists of around 300k records of data which will consider every aspect of a customer and then records it. It also requires some cleaning in the form of the date and their format. The date is in the timestamp format and so it requires us to change it to a normal date format using datetime libraries.

The name of the columns and their meaning or description are mentioned below:

Artist: Composer of the song

Auth: login status of the user

firstName: first name of the user

gender: gender of the user

ItemInSession: operation sequence number of this session, from small to large according to time

lastName: the surname of the user

length: the length of the song

Level: level of user’s usage i.e paid/user

location: the location of the user

method: method of getting web pages (put/get)

page: different options for pages in the Sparkify application

registration: timestamp for registration point in time of the user

sessionId: sessionId that should be used to determine a single login operation

song: song name

status: page return code (200/307/404)

ts: the timestamp of the log time

UserAgent: browser used by the user along with the version

UserId: unique id allotted to each user for identification purposes

Metrics

Identifying the users that are churned is a part of the project, but when we are making predictions using ml models, we will need metrics to measure the models and finally select the best one. In this project, we have focused on precision, recall score, and f-1 score. While precision will help us understand the model by identifying relevant churn customers, it won’t be enough to make an actual picture visible to us. We include recall score and f-1 score as a measurement standard for the training and evaluation of our models.

Precision and Recall are used for information retrieval. F score is a measure of a classification’s USEFULNESS. F-score can vary from 0 to 1, where 0 indicating that classification had no (considered useless) predictive power and 1 indicating that it was a perfect classification. The F-1 score or F-score is derived by the combination of both precision and recall and weighing them with a harmonic mean.

The main reason to use F-1 score over accuracy is that in a real-life scenario of a classification problem (e.g. predicting customer churn), imbalanced class distribution exists and thus F1-score is a better metric to evaluate our model on. Accuracy will only focus on predicting correctly, it will not focus on our target segment. Whereas, F-1 score will balance both Recall & Precision, and also focus on our target segment prediction. In this project, we specifically want to focus on ‘predicting customer churn’ and not on how correctly the model has predicted any outcome. We could have only used Recall for this purpose, but that would have resulted in a situation where we ignore the precision rate (which intuitively the ability of the classifier not to label as positive a sample that is negative). We needed balanced metrics that can weigh in both situations and can help us in making the decision so we consider F-1 score to determine which is a better classifier.

Exploratory Analysis & Visualizations

This part of the project included doing certain steps just to explore the data like Looking at different types of PAGES in the app, Different authorization levels provided by Sparkify, How many users have opted for different authorization levels, Gender spread in users, and many more mentioned below.

Let’s start by exploring the data using visualizations.

Gender Split in the data

From the visual, we can deduce that there are around 120 unique Male users and ~100 unique Female users. It looks like a balanced spread of gender across different types of users of Sparkify.

Count of Unique Users by Location (City)

The above visual represents how the users are spread across different locations-city. The maximum users can be located in 2 metropolitan areas of Los Angeles and New York, which is expected considering the fact that people in such places like to use such apps. There is also a visual attached below which shows the users by their state.

Users by usage during Weekdays and days of Month

The graph tells us that users are more likely to use it during weekdays in their routine for e.g. when they are traveling in subways or in their cars. Since most people like to stay at their homes with families or loved ones they tend to drift away from such social media platforms. The spread of users across days of the month can also be found below.

Count of User by Browser used and Platform of OS

(i) Number of Users by browser (ii) Number of Users by Platform

The visual can be interpreted as most of the users are using the chrome browser while they are using the Sparkify app. The number of users is significantly greater than in other browsers.

Outlier Detection

Outliers: They are unusual values in your dataset, and they can distort statistical analyses and violate their assumption. Resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.
In our project, we are not sure if any outliers are natural user behavior or abnormality in the record observation. Hence, I have decided not to venture and remove outliers. Also, this project uses a subset of original data so the results are already not going to be as accurate and the cost incurred in terms of processing and man-power are not worth the effort (I have learned to make a trade-off like this in my Intro to Data Science class at university).

Missing Values

Let’s consider this chunk of data that will consist of a lot of missing values in a lot of different columns. A sure-shot way of recognizing a way to remove any records because of missing values in any column. This can be because of my limited knowledge of the data and concept of the missing values, and hence I have decided to go ahead without removing the missing values.

I accept that handling of outliers and missing values play an important role in the analysis, and I have gone ahead without eliminating that issue. I have giving concrete reasoning of why and how should be handled in all the other projects where I had a better understanding of the data.

Exploring Churn Data

The churn of the users can be defined by using the data of Cancellation and Confirmation events. We can identify the users who went to a page called Cancellation Confirmation make a list of such unique userIds and add a column indicating that the user has churned.

This visual gives us data on the number of males and females that have been churned. We can say that Males have a greater tendency to churn, and that can also be attributed to the fact that we have a bigger chunk of Males using the Sparkify app than females.

Churn data by Levels of Users

It can be observed that more users at the paid level churned.

Churn data by hour of the day

The graph says that most of the users have churned during 14:00 hours and 20:00 hours which can be backed by saying that when most people are frustrated with work, even a small level of dissatisfaction can alleviate the frustration and can lead to churn.

The plot to the left is an important step in identifying and approving our theory of hours that have maximum users in a day. It can be observed that where we see most users not churning and most users churning will give us a fair idea of the total maximum users per hour throughout the day.

Feature Engineering

Like all other machine learning projects, we need features that we can use to get the predictions for our business problem. A similar approach was taken here and the data was used to create custom features to use it to predict the churn problem for a user. There are 6 specific features that I have used to create a dataframe and then use them for prediction, and they are listed below:
i) Number of days since registration
ii) Gender
iii) # of songs per session
iv) # of sessions
v) Frequency of use of Pages
vi) # of singers the user has explored/heard?

Once this dataframe was ready to use, the project moved to modeling phase.

Modeling

In this step, just like other ML projects, the data was split into 3 sets i.e. Train, Test, and Validation sets. The training set was used to train the model, the testing set was used to test the model created, and the validation set to validate the model created by checking if there is any memorization of data by the model during the training process.

The business problem we have at our hand to identify the users that will churn i.e. it makes it a binary classification problem. It becomes essential to identify this step and select the models to abstain ourselves from applying the wrong machine learning algorithms. I have implemented a few binary classification algorithms namely Decision Tree Classifier, Logistic Classifier, and GBT Classifier.

Decision Tree Classifier: The decision Tree algorithm belongs to the family of supervised learning algorithms. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data). In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

Sample decision tree visual representation

The reason behind using this algorithm is that it can be used to classify values and so is our business problem. We want to predict if a user has churned or not.

Types of Decision Trees

Types of decision trees are based on the type of target variable we have. It can be of two types:

Categorical Variable Decision Tree: Decision Tree which has a categorical target variable then it is called a Categorical variable decision tree.
Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree.

Logistic Regression/Classifier: It is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of the target or dependent variable is dichotomous, which means there would be only two possible classes. In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no). Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection, etc.

Types of Logistic Regression

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into the following types −

Binary or Binomial:

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss, etc.

Multinomial

In such a kind of classification, the dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

Ordinal

In such a kind of classification, the dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have scores like 0,1,2,3.

In this project, we have used the binomial considering that we have 2 categories of level i.e. if a customer churns or not, or 2 class output.

GBT Classifier: Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. (A detailed explanation of the GBT algorithm is beyond the scope of this blog)

Improvements

There is a need for Grid Search to select the best parameters to select the best model and get the best predictions for our business problem.

Grid Search: A technique that is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

Hyper-Parameters: In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. Various types of hyperparameters used are regParam, elasticNetParam, maxIter, maxDepth, and impurity.

Different hyperparameters used in the final model are explained below:

regParam: This is one of the parameters often used in machine learning to provide additional details to aid regularization, which is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

elasticNetParam: This parameter is the one that is used to specify the extent of considering our regression as Lasso Regression. if a linear regression model is trained with the elastic net parameter α (alpha) set to 1, it is equivalent to a Lasso model. On the other hand, if α is set to 0, the trained model reduces to a ridge regression model.

The logistic Regression/Classifier turned out to be the best model in terms of the metrics selected. We need to keep in mind that these results are results of the features that we created during the feature engineering step, and the hyperparameters selected and tuned. The results will vary (including best model) with different implementation steps at different phases of this project. Different tuning leads to a different type of model and hence resulting in a different performance of the model. Like increasing maxIter leads to more number of iterations while training cycle, sometimes it also leads to overfitting the data if set to a very high figure.

The detailed view of my result and how I got them are available in the jupyter notebook on Github.

Results

The selection of the best model can sometimes be tricky because we have to do a trade-off between the time taken by a model to generate the output and other metrics like precision, recall score, and f-1 score. For my case, I have tried to focus on the f-1 score, and it can be explained as given below:

F-1 Score. The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between precision and recall.

Recall Score: The ability of a model to find all the relevant cases within a dataset. The precise definition of recall is the number of true positives divided by the number of true positives plus the number of false negatives.

Precision: Precision is defined as the number of true positives divided by the number of true positives plus the number of false positives.

We have 2 models giving us the best f-1 scores are the Logistic Regression model namely the lr, and the lrs. The lr model gave an f-1 score of 0.85 but if we closely see other outputs we can see overfitting and hence we should select the lrs model which gives us an f-1 score of 0.736. This score may look low/small but in a professional environment, any score above 0.7 is considered a good one. For the above-mentioned reason, I believe the Linear Regression model fits the best for this project with the approach that I took.

Conclusion

This project made us use machine learning along with spark, which helped us to process a large amount of data, gain insights, and develop actions from the result in a scalable manner.

End-to-End Summary: This project starts with a business case of predicting whether a customer churns or not. It provides us with a big chunk of data and provides flexibility in using the spark technology to solve the problem. The project starts with ingesting data and cleaning it using Spark. Exploratory Data Analysis with both, normal data and data with churn data (generated), is carried out which included many visuals to understand the data. Feature engineering was performed to manufacture relevant features which will be used for modeling purpose. When it comes to modeling, we use 3 types of algorithms i.e. Decision Tree Algorithm, Logistic Regression, and GBT (Gradient Boosting) Algorithms. All of them are algorithms that work as a classification algorithm. One thing to notice here is that there was hyperparameter tuning performed to change the nature of the underlying algorithm and the outcome of the process.

How can the project be improved further?
Feasibility and cost become very important in real applications. Such business models require the process to be carried out on a monthly or weekly basis depending on their business requirements, scale, data latency, and operations. Other important factors to weigh while carrying out such projects on a large scale include considering the operational cost, result validation, A/B testing, metrics, and KPIs should be carefully selected.

Reference

[1] https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
Churn Prediction with PySpark using MLlib and ML Packages
[2] https://www.kaggle.com/blastchar/telco-customer-churn
Telco Customer Churn
[3] https://www.kaggle.com/c/kkbox-churn-prediction-challenge
WSDM — KKBox’s Churn Prediction Challenge