Learn to build a predictive model to track churn with Alteryx (part 2)
In our previous article we prepared our data so that we can use it in a predictive model meant to estimate the risk of churn. To this aim we did some data cleansing and standardisation of categorical and numeric variables. In this article we test a predictive model that is available in Alteryx Designer: a random forest.
Select interesting variables
First of all we use a Select tool to deselect the variables ending with the suffixes “_No”, “_No_internet_service” or “_No_phone_service”. These are redundant with their positive counterpart and thus useless. The customerID will not be used for the prediction and can also be deselected. Make sure that the “churn” category has a “String” type.
Give access to 50% of the dataset
In order to validate our model afterwards we need to train our model on part of the data set and keep the rest for the validation. We will do this by using a Random Sample tool.
In the configuration we indicate that we want to select 50% of our entries. To be sure that the model always selects the same entries we use the option “Deterministic output”.
Configuration of the Random Forest tool
In the “Predictive Toolbox” we choose the Random Forest Model tool. Then we select the binary variable churn as target variable and we select all other variables as predictor variables. In this case we keep the default configuration of the tool.
Understand the results
By connecting a Browse tool to the Reports (R) anchor of the Random Forest tool we can see results of the model after we have run the workflow.
These figures illustrate the quality of the resulting model for the given data. They do not represent the quality of the model for predicting churn on another dataset. This would require a thorough process of validation on an independent dataset.
The graph “Percentage Error for Different Numbers of Trees” represents how the errors vary with the number of trees for predicting loyal clients (curve “0”) and churners (curve “1”). Out-of-bag (OOB) error, allows you to measure the quality of the prediction.
OOB error uses observations that were not used to train the regression trees (remember that each tree is built on the basis of a subset of observations). This approach gives a good estimation of the error rate. In this case we see that the error no longer considerably diminishes with more than 100 trees, which indicates that the model is sufficiently trained.
The confusion matrix compares the outcome of the prediction (churner or loyal) in columns with the observed status in rows.
The model identifies 47% of the churners. This may seem inefficient, but we must consider that on the other hand it identifies very few loyal customers as churner (10%). For a company it is already interesting to be able to reach 50% of the potential churners without contacting too many loyal customers.
The variable importance plot indicates how each predictor variable enhances each regression tree when they are selected in the building process. It appears that charges are an important variable to separate churners from loyal customers, whereas payment methods are not.