Learn how to estimate the risk of churn in a customer database by using the power of Alteryx

Learn how to estimate the risk of churn in a customer database by using the power of Alteryx

This article is the beginning of a new cycle. In the two following articles we are going to develop a predictive model in Alteryx. This model will estimate the risk of churn of telco customers and can be used to set up a retainment policy.

If you download the dataset that is available on Kaggle you will be able to follow and reproduce the different steps. This dataset contains a set of customers characterised as loyal or churner and variables which could explain the behaviour. We will predict the churn status by using these variables. In this article we prepare the data in order to develop a predictive model in the next edition. The third edition will address model validation.

Data cleansing

We start with the file we downloaded. As it is in csv format, all fields are “variable length string” (V_String) types. We are going to use the Auto Field tool to transform fields to the correct data type and the smallest possible field size for each column. This can significatively accelerate workflows for big data sets. We have to make two manual changes. The field SeniorCitizen contains 0 and 1, so we transform it to string type. This way we will be able to treat it as a binary categorical variable.

The field TotalCharges was automatically set to V_string because it contains blank cells. We set the type to double and replace blank cells by 0 by using a Data Cleansing tool.

We separate predictor variables in two groups: continuous numeric variables and categorical variables. Each kind of variable will be prepared in a specific way.

Prepare numeric variables

We first select all numeric fields using a Dynamic Select tool with all numeric types checked. However, we also would like to select the customerID field in order to join, on customerID, the data flow of numeric variables with the data flows of binary categorical variables, in a next step. Therefore, we use a Select tool to select the customerID field and we join it by position to the output of the Dynamic Select tool.

Some modelling approaches require that numeric predictors are standardised. We do not know yet which approach we will use so we standardise all numeric fields.To do this we transpose the table in long format. Then we group by field name, calculate the mean and the standard deviation and join them for each predictor variable. This allows us to apply the formula of standardisation. Finally, we use the Crosstab tool with Name as new column headers and Value as new columns, and we get a table with one column per predictor variable. We have finalised the preparation of the numeric variables. Now we have to focus on the categorical variables.

Prepare categorical variables

We use a Dynamic Select tool again, but this time with the four different string types checked. We then use a Transpose tool with customerID as Key field and all other fields, including Churn,as Data fields. This way we obtain the table in long format. We now have one row per customer and per variable. We rename the field “Value” by “Key”. We use the Summarize tool to count the number of different modalities for each question. We distinguish binary variables (having only two modalities) from other categorical variables (having at least three modalities). In both cases we use a join tool on the data in long format to select relevant variables.

For binary variables, we make some replacements. This way we only have values equal to 0 and 1 and we can convert these values to the (numeric) Byte field type.

For non-binary categorical variables, we want to create one column per category, filled with 1 if the customer belongs to this category and with 0 otherwise. To do this, we use a Formula tool to add the category (field “Key”) as a suffix to the name of the variable (“Name”). Then we create a dummy variable Value = 1. We put the table back in wide format with a Cross Tab and replace NULLs by 0 with a Data Cleansing tool.

Finally, we use a Join Multiple tool to join our three resulting flows (numeric variables, binary variables and categorical variables with more than two categories) on the customerID. Save the output, of course.

You were able to perform these different steps? Congratulations! By doing this we obtained a clean, standardised, ready to use dataset. In our next article, we will develop a predictive model with our data.