Sounds Like a Duck

Category: Programming
Tags: soundexphpmetaphone

I’ve been doing a lot of reading lately about soundex and other algorithms for matching words. In my iOS application, I have a simple alliteration function, basically just looking for matching starting characters. in the next version, I want to expand the alliteration and have a sounds like function.

Soundex is a nice simple function, primarily designed to work with surnames, that converts a name to a starting consonant sound and digits representing the remaining sounds. It’s a coding system that’s been around since the early 1900’s. It works, and it’s easy to use, but it is limited to english.

So, I setup a simple php test script to look at soundex results. The script creates a soundex code of the input, takes the first three characters of the soudex code and returns the results of an sql call:

$soundex = soundex($soundslike);
$soundexPrefix = substr($soundex, 0, 3);
$sql = "SELECT name from names WHERE SOUNDEX(name) LIKE '$soundexPrefix%' order by rand() limit 10";

This works ok. Here are a couple of examples:

input      name            sybcount   soundex
Bill       Bloi            1          B400
Bill       Biley           2          B400
Bill       Ballou          2          B400
Bill       Boele           1          B400
Bill       Balow           2          B400
Bill       Builly          3          B400
Bill       Billiou         2          B400
Bill       Bayol           2          B400
Bill       Bella           2          B400
Bill       Ballie          2          B400

input      name            sybcount   soundex
Louis      Laux            1          L200
Louis      Laske           1          L200
Louis      Leigha          2          L200
Louis      Luis            2          L200
Louis      Lasse           1          L200
Louis      Lucija          3          L220
Louis      Lezo            2          L200
Louis      Laessig         2          L220
Louis      Luiz            2          L200
Louis      Lucek           2          L220

Without prefixing the soundex return, the results are more narrow, for example:

input      name            sybcount   soundex
William    Willaim         2          W450
William    Wollin          2          W450
William    Willman         2          W455
William    Wallin          2          W450
William    Wahlman         2          W455
William    Whelan          2          W450
William    Wallman         2          W455
William    William*        3          W450
William    Woolen          2          W450
William    Willam          2          W450

sybcount is a rough function I’m working on to count syllables in the name returned. Since I’m looking to do sounds like for full names, I’m going to expand the script a bit to get given and surnames and return full name. For sounds like, I want the user to have the flexibility of options like:

  • Bob Newhart
  • % New%
  • Bob New%

In this case, using % as a wild card. The first choice, the first name would sound like Bob, the last would sound like Newhart.

Looking at option 1, if I set no prefix for the soundex, meaning use the full soundex code, Bob returns nothing. If I set the prefix to 1, then it returns anything that starts with B or N. The results are not impressive:

soundslike: Bob, Newhart
soundex, B, N
SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'B%' order by rand() limit 10
SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'N%' order by rand() limit 10

inputs               name                      sybcount
Bob Newhart          Babette Nicolette         6
Bob Newhart          Bibi Nowland              4
Bob Newhart          Bijan Nuffer              4
Bob Newhart          Breanne Nasri             4
Bob Newhart          Bahnbeamter Nedina        6
Bob Newhart          Baldassare Nagengast      7
Bob Newhart          Beggue Nolette            4
Bob Newhart          Baptiste Nabavian         7
Bob Newhart          Byron Najera              5
Bob Newhart          Barnardus Nickless        5

No prefix works with some names, but still, the results are not great:

soundslike: Jimmy, Carter
soundex, J500, C636
SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'J500%' order by rand() limit 10
 SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'C636%' order by rand() limit 10

inputs               name                      sybcount
Jimmy Carter         Joannah Cryder            4
Jimmy Carter         Jenna Courtright          4
Jimmy Carter         Jemina Carrethers         6
Jimmy Carter         Jamey Caruthers           5
Jimmy Carter         Jana Carrothers           5
Jimmy Carter         Jannie Crowther           4
Jimmy Carter         Jane Cordeiro             5
Jimmy Carter         Janie Carruthers          5
Jimmy Carter         Jesenia Cardeiro          7
Jimmy Carter         Jasna Cardera             5

Metaphone is a popular sounds like alternative. So, again, I want to convert the incoming sounds like names into metaphone code, the restrict the result set to names that match part of the code.

$metaphonef = metaphone($soundslikef, $prefix);

Metaphone with an unlimited phonemes returned better results for the Bob Newhart use case:

soundslike: Bob, Newhart
metaphone: BB, NHRT

inputs               name                      sybcount
Bob Newhart          Bibi Neuhart              4
Bob Newhart          Baby Neyhart              4
Bob Newhart          Bebe Newhard              4
Bob Newhart          Bobbie Nihart             4
Bob Newhart          Bobbi Newhart             4
Bob Newhart           Neihart                  2
Bob Newhart                                    0
Bob Newhart                                    0
Bob Newhart                                    0
Bob Newhart                                    0

If I cut the phonemes down to 2, the results are more promising:

soundslike: Bob, Newhart
metaphone: BB, NH

inputs               name                      sybcount
Bob Newhart          Babette Nyhan             5
Bob Newhart          Bobbye Nihart             4
Bob Newhart          Bobbie Nyhus              4
Bob Newhart          Babatunde Neihart         6
Bob Newhart          Bibi Nhep                 3
Bob Newhart          Bebe Newhook              4
Bob Newhart          Bobak Newhall             4
Bob Newhart          Bobbi Neehouse            4
Bob Newhart          Boban Nohe                3
Bob Newhart          Babak Newham              4

Many folks suggest a combination of Metaphone and Levenshtein distance calculations. Use the Levenshtein distance to set an error margin. Levenshtein distance calculates the variance between two strings. So two strings with a distance of 1, there is one variance between the two. So, the idea is to calculate the Metaphone code for the input variable, and each name, then determine a percent difference to look for matches. This is an interesting idea, and probably more useful for narrowing searches, or fuzzy input matches, but it is not yielding good results in this case.

Looks like the Metaphone search with no limit is giving me the best results. Here are a couple of name examples:

soundslike: Jimmy, Carter
metaphone: JM, KRTR

inputs               name                      sybcount
Jimmy Carter         Jaimee Grider             4
Jimmy Carter         Gema Graeter              4
Jimmy Carter         Jomo Kreider              4
Jimmy Carter         Jaymie Gorter             4
Jimmy Carter         Jem Greider               3
Jimmy Carter         Jamee Cardeiro            5
Jimmy Carter         Jimmie Cordrey            4
Jimmy Carter         Jammie Credeur            4
Jimmy Carter         Jamey Cartier             4
Jimmy Carter         Jimmy Crader              4

Metaphone Levenshtein Distance
metaphone: JM, KRTR

inputs               name                      sybcount
Jimmy Carter         Gentien Grentmesnil       5
Jimmy Carter         Shaunta Aristizabal       7
Jimmy Carter         Ivarson Brightharp        5
Jimmy Carter         Ersan Hilderbrandt        5
Jimmy Carter         Thiphaine Macnaughton     6
Jimmy Carter         Katheleen Policastri      7
Jimmy Carter         Dorothee Vonderhaar       7
Jimmy Carter         Rebeccah                  3
Jimmy Carter         Ederson                   3
Jimmy Carter         Svaer                     1

soundex, J500, C636

inputs               name                      sybcount
Jimmy Carter         Jazmine Cryderman         6
Jimmy Carter         Jaymie Czartoryski        6
Jimmy Carter         Jonn Cardero              4
Jimmy Carter         Jón Cordray              2
Jimmy Carter         Jannine Charter           5
Jimmy Carter         Jane Crouter              4
Jimmy Carter         Jesenia Cruthird          6
Jimmy Carter         Joon-ho Cortright         4
Jimmy Carter         Jimmy Courter             4
Jimmy Carter         Joannin Carrethers        5

soundslike: Mary, Sue
metaphone: MR, S

inputs               name                      sybcount
Mary Sue             Meir Sooy                 3
Mary Sue             Meri Zou                  3
Mary Sue             Miri Sy                   3
Mary Sue             Mr Sey                    1
Mary Sue             Mary Hsu                  3
Mary Sue             Maura Zeh                 3
Mary Sue             Merri Su                  3
Mary Sue             Maihri Hsieh              3
Mary Sue             Mairi Sou                 3
Mary Sue             Mária Sea                3

Metaphone Levenshtein Distance
metaphone: MR, S

inputs               name                      sybcount
Mary Sue             Jerimas Gabrielsen        6
Mary Sue             Evariste Nett             5
Mary Sue             Norfolkeast Patsula       6
Mary Sue             Rebecka Suggs             4
Mary Sue             Denisha Klemetz           5
Mary Sue             Tressie Halgas            4
Mary Sue             Carita Robida             6
Mary Sue             Wesley Mish               3
Mary Sue             Cezary Gosson             5
Mary Sue             Hupert Puelo              4

soundex, M600, S000

inputs               name                      sybcount
Mary Sue             Muammer Saka              5
Mary Sue             Munir Shyu                3
Mary Sue             Marya Secky               4
Mary Sue             Maira Schuh               3
Mary Sue             Mireya Saia               5
Mary Sue             Meri Sheeks               3
Mary Sue             Mara Sow                  3
Mary Sue             Mhairi Schewe             3
Mary Sue             Munira Schei              4
Mary Sue             Maria Sockey              5

Ok. Decision made. I’m going with the simple Metaphone approach. On the server side of the application, this should be pretty easy, php has a built in metaphone() function. However, Object-c will be a little more challenging. The algorithm for the metaphone and double metaphone are publicly available, so I’ll probably just end up writing my own function.

One thing I need to sleep on, should I pre calculate all of the metaphone codes and make a data structure change to accommodate the new column. This would allow me to greatly reduce the time to find matches. Before making the data structure change, I will probably mock it up and see if it will work as expected.

That should be enough nerd talk for today.


