Validating non-UK addresses

Recently, I examined a sizeable set of UK application data which contained previous addresses and found there was around 30% which could be classed as non-UK. From other organisations, I have seen a range between 15% and 30% depending on the product and demographic*.

Non-UK addresses are not an issue, however if you want to achieve anything meaningful with them it can be a problem if the country has not been provided.

  • Validation of non-UK identity (i.e. name and foreign address) – without understanding the country relating to the address then you cannot access databases and services that may improve electronic identity pass rates
  • Identification of countries – e.g. sanctioned countries or others that require additional scrutiny

Some organisations will directly ask for a country to populate a dedicated field, others will have a free form address and others might try and capture the addresses in a UK style format. If the organisation has any address standardisation or lookups then without a country identifier this may be challenging.

Taking a look at addresses, approximately 50% can be identified through the clever use of lookups (assuming we have identified UK addresses through a valid UK postcode format):

  • Is there a country name in the field? – this can create false positives if you have patterns such as “DENMARK STREET” in the data. The solution here could be to have a pre-filter which looks for these patterns and removes them before checking against lists. But, this also needs to consider global standards and address features.
  • Is there an International Organisation of Standardisation (ISO) country code in the field? – again, this can cause false positives if the lookup is not using the last word in the string – for instance “DE” at the end of a string is different to a “DE LA” in the middle.
  • Are there any country specific features in the data? – Some examples could be “Ul.”, “Ulica”, “Via”, “Rue”, “Rua” and so on. These may also help to narrow to a specific language style as well – Portuguese vs Brazil vs South India.
  • Are there towns, counties or other named features in the data? – Again, similar false positives can apply here. Also, care needs to be taken with the data used for the lookup as many towns and countries are replicated in different countries and these should be removed. It is common to see towns in Australia, Canada and the US with the same name as UK towns.

But what of the remainder? This is where a machine learning process could be utilised to predict the country from the address data, using a dataset containing a variety of addresses and their countries. A person that has been trained to identify patterns in the addresses, and consider names, can identify a country with a high degree of accuracy; particularly if they have local information, experience or knowledge. For instance, understanding that “Riga” is in Latvia, the certain patterns of letters that are characteristic of Romanian names and addresses, or how some USA addresses have a two-character state code. However, this takes time and uses considerable resources.

Which makes this manual and long-winded task a prime candidate for a machine learning experiment; “Given data, can we classify addresses to their likely country of origin?”.

*The presence of a foreign previous address may be symptomatic of the product being offered. For instance, if the applicant is looking for a basic bank account which should be available to any EU citizen or those without a credit footprint, then the bank would expect none (or very little) credit activity on the bureau regardless as to whether they have supplied 3 years or 6 months address history.

For my latest blogs on current identity and fraud market issues and challenges please click here.