Home » Cloud » Dedupe with Metaphone and Google Maps

Dedupe with Metaphone and Google Maps

Litmus Test to Identify Duplicates

Have you bought tool to identify,remove or prevent duplicate contacts, leads etc?
This is how you can test if it is worth it’s money.

Enter two contacts:

1. Marc Foreman, 139 E. Lancaster Ave. Radnor PA.

and

2. Mark Phoreman, 139 East Lancaster Avenue, Wayne Pennsylvania.

Now enter duplication criteria as Name fuzzy match with distance less than 3 and Address exact match.

Does your tool identify these two contacts as probable duplicate? If yes, then congratulations, You have bought the right tool.

If not, try using free but very powerful tools available: Metaphone and Google Maps API.

But why are these records duplicates? Names are sounding similar but addresses are different. But are they?

Only map APIs such as Google Map can answer this question.

First let us see how Names are identified as similar.

Metaphone

Metaphone is a phonetic algorithm which creates metaphonic code for English words. In my previous Blog I have explained how Soundex algorithm can be used to perform phonetic search or match by indexing with Soundex code. Similarly Metaphone can be used to index and match words phonetically.  Soundex uses first alphabet  to generate Soundex code which is a major impedement. e.g. in this case Phoreman has a Soundex code P650 while Foreman has F650. Thus Soundex will never identify them as phonetically alike. On other hand Metaphone correctly generates code as MRK for Mark and Marc while FRMN for Foreman and Phoreman.

We can also use  Apex getLevenshteinDistance string method to calculate distance between the two strings. In our case it is 2 for the last name while 1 for the first name.

Google Maps API

Now let use perform Address matching with Google API.

Google Maps API is free as long as it is also used to display records on Map. It can also be used to validate and correct addresses, geocoding and reverse geocoding.

Let us normalize address 139  E. Lancaster Ave. Radnor PA as follows:

http://maps.googleapis.com/maps/api/geocode/xml?address=139+Lancaster+Ave,+radnor+PA&sensor=false

It gives following XML response:

</pre>
<pre id="line1"><?xml version="1.0" encoding="UTF-8"?>
<GeocodeResponse>
 <status>OK</status>
 <result>
  <type>street_address</type>
  <formatted_address>139 East Lancaster Avenue, Wayne, PA 19087, USA</formatted_address>
  <address_component>
   <long_name>139</long_name>
   <short_name>139</short_name>
   <type>street_number</type>
  </address_component>
  <address_component>
   <long_name>East Lancaster Avenue</long_name>
   <short_name>E Lancaster Ave</short_name>
   <type>route</type>
  </address_component>
  <address_component>
   <long_name>Wayne</long_name>
   <short_name>Wayne</short_name>
   <type>locality</type>
   <type>political</type>
  </address_component>
  <address_component>
   <long_name>Radnor</long_name>
   <short_name>Radnor</short_name>
   <type>administrative_area_level_3</type>
   <type>political</type>
  </address_component>
  <address_component>
   <long_name>Delaware</long_name>
   <short_name>Delaware</short_name>
   <type>administrative_area_level_2</type>
   <type>political</type>
  </address_component>
  <address_component>
   <long_name>Pennsylvania</long_name>
   <short_name>PA</short_name>
   <type>administrative_area_level_1</type>
   <type>political</type>
  </address_component>
  <address_component>
   <long_name>United States</long_name>
   <short_name>US</short_name>
   <type>country</type>
   <type>political</type>
  </address_component>
  <address_component>
   <long_name>19087</long_name>
   <short_name>19087</short_name>
   <type>postal_code</type>
  </address_component>
  <geometry>
   <location>
    <lat>40.0440030</lat>
    <lng>-75.3864640</lng>
   </location>
   <location_type>ROOFTOP</location_type>
   <viewport>
    <southwest>
     <lat>40.0426540</lat>
     <lng>-75.3878130</lng>
    </southwest>
    <northeast>
     <lat>40.0453520</lat>
     <lng>-75.3851150</lng>
    </northeast>
   </viewport>
  </geometry>
 </result>
</GeocodeResponse>

Both the addresses get exactly same response.

Google has not only corrected the address but provided lot more information such as latitude, longitude,street number,zip code etc. This is very important information and can be used for location analytic. But for now we are going to use formatted address which is 139 East Lancaster Avenue, Wayne, PA 19087, USA  for both the records. Thus by using Google Maps API and Metaphone algorithm in tandem we can deduce that these two records are most probably duplicates and need to be mitigated. We have also managed to validate and clean our addresses.

Conclusion

As old saying goes: Best Things in Life are FREE.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: