Load, check & join your data: how to use these great tools

In this article we want to show you how to use the different Alteryx join tools in order to analyse your data and obtain great insights. In this case we have data regarding customers and their purchases. In order to make interesting analysis, we have to join both files. This will allow us to estimate in a next article the market share of a company in different regions and its local performance.

Customer and purchase data are loaded, let’s get started!

Before joining both lists, we want to be sure that the customers in the customer file are unique. To check this, we use the unique tool to verify the customer ID. This makes it possible to easily remove doubles. We do not place a unique after the purchase table because one client can make multiple purchases.

Watch out for the impact of double entries

If we have a double entry for a customer, then all the rows corresponding to this customer in the list of purchases will be doubled. The result is that the total spending of this customer will be overestimated by a factor of two. The presence of doubles in tables is a typical source of pernicious errors, but easily avoidable in Alteryx workflows!

join tool
configuration to join data

To join our lists we use a “classic” join tool and we connect the customer table to the “Left input” anchor and the purchase table to the “Right input” anchor. Then, we select one or multiple fields on which we want to join the data. This will only work if the fields are present in both inputs, but they do not have to be named the same way.

When joining fields, they must have the same general type (string, date/time or numeric), otherwise the process will result in an error message. This avoids any irrelevant and potentially risky matching between data sources.

Smart tip

Joining on decimal numbers (float or doubles) is possible but not recommended. The join may be hampered due to precision and rounding issues. Doing this will result in an Alteryx warning message!

(Un)matched data

Results join data

As you can see, not all the entries of both inputs are systematically matched. This is one of the huge advantages of the Alteryx Join tool. All entries are traced and none are lost silently.

  • Behind the middle output we find our new list of customers and expenses.
  • Entries in the left output are customers that did not match with the purchases list. These are probably customers who did not purchase anything in the observed period of time.
  • The entries in the right output are purchases that did not match with any customer. We might have a data quality problem here!

If we look closer, we observe that before 2017 the customer ID was miscoded in the list of purchases (the “ID” characters in the two first positions are missing).
We can now correct the format of the customer ID and join these entries again to the customer list.

We use a union tool to combine the output of the two join tools. The union tool with the option “Auto config by name” takes two or more tables with the same fields and combines them vertically based on field names.

Be careful with the options of the join tool

In this example we have used the default option which is “joining by field”, but you can also make the join according to the position of records. This is not recommended as the risk that the position of records is modified accidentally is large, for example if another analyst works on the workflow without knowing that this order is critical.

In this use case we have joined a file with customer information and purchases data. You have seen that the join tool is an intuitive, versatile and efficient way to blend, control and correct your data in a reproducible way.

The outcome is a new table with all information in which we are now able to, for example, aggregate all spending by postal code. By comparing these to the market potential we would obtain a view of the local performance. But this will be a topic for a next article.