Learn to build a predictive model to track churn with Alteryx (part 2)

Learn to build a predictive model to track churn with Alteryx (part 2)

In our previous article we prepared our data so that we can use it in a predictive model meant to estimate the risk of churn. To this aim we did some data cleansing and standardisation of categorical and numeric variables. In this article we test a predictive model that is available in Alteryx Designer: a random forest.

churn workflow alteryx

A random forest combines hundreds or thousands of decision trees. Each decision tree is built on a random subset of the observations, considering a limited number of the features.

This approach ensures a good diversity and avoids the inherent overfitting risk of the regression tree method. The final predictions of the random forest are made by averaging the predictions of each decision tree. This method can be compared to a decision-making process of a human assembly. Each person analyses according to the elements at his disposal (the feature) and builds his decision (decision tree) by analysing the facts at his disposal (subset of observations).

Select interesting variables

First of all we use a Select tool to deselect the variables ending with the suffixes “_No”, “_No_internet_service” or “_No_phone_service”. These are redundant with their positive counterpart and thus useless. The customerID will not be used for the prediction and can also be deselected. Make sure that the “churn” category has a “String” type.

Give access to 50% of the dataset

In order to validate our model afterwards we need to train our model on part of the data set and keep the rest for the validation. We will do this by using a Random Sample tool.

In the configuration we indicate that we want to select 50% of our entries. To be sure that the model always selects the same entries we use the option “Deterministic output”.

Configuration of the Random Forest tool

In the “Predictive Toolbox” we choose the Random Forest Model tool. Then we select the binary variable churn as target variable and we select all other variables as predictor variables. In this case we keep the default configuration of the tool.

Understand the results

By connecting a Browse tool to the Reports (R) anchor of the Random Forest tool we can see results of the model after we have run the workflow.

These figures illustrate the quality of the resulting model for the given data. They do not represent the quality of the model for predicting churn on another dataset. This would require a thorough process of validation on an independent dataset.

The graph “Percentage Error for Different Numbers of Trees” represents how the errors vary with the number of trees for predicting loyal clients (curve “0”) and churners (curve “1”). Out-of-bag (OOB) error, allows you to measure the quality of the prediction.

OOB error uses observations that were not used to train the regression trees (remember that each tree is built on the basis of a subset of observations). This approach gives a good estimation of the error rate. In this case we see that the error no longer considerably diminishes with more than 100 trees, which indicates that the model is sufficiently trained.

The confusion matrix compares the outcome of the prediction (churner or loyal) in columns with the observed status in rows.

The model identifies 47% of the churners. This may seem inefficient, but we must consider that on the other hand it identifies very few loyal customers as churner (10%). For a company it is already interesting to be able to reach 50% of the potential churners without contacting too many loyal customers.

The variable importance plot indicates how each predictor variable enhances each regression tree when they are selected in the building process. It appears that charges are an important variable to separate churners from loyal customers, whereas payment methods are not.

The use of a random forest is a good option for the prediction of churn. It is rather easy to understand the different steps and to determine the weight of variables selected by the model. However it is interesting to try different types of models and to verify the performance of each of them.

Finally, even if the Random Forest method has its own error estimator (OOB error), it is important to validate a model with a data set that has not been used to train the model. A model can provide good results on the data meant for training but could be unstable when other data is used. In the next article you will learn how to validate and save predictive models.