I’ve been doing a lot of reading lately about soundex and other algorithms for matching words. In my iOS application, I have a simple alliteration function, basically just looking for matching starting characters. in the next version, I want to expand the alliteration and have a sounds like function.
Soundex is a nice simple function, primarily designed to work with surnames, that converts a name to a starting consonant sound and digits representing the remaining sounds. It’s a coding system that’s been around since the early 1900’s. It works, and it’s easy to use, but it is limited to english.
So, I setup a simple php test script to look at soundex results. The script creates a soundex code of the input, takes the first three characters of the soudex code and returns the results of an sql call:
$soundex = soundex($soundslike); $soundexPrefix = substr($soundex, 0, 3); $sql = "SELECT name from names WHERE SOUNDEX(name) LIKE '$soundexPrefix%' order by rand() limit 10";
This works ok. Here are a couple of examples:
input name sybcount soundex Bill Bloi 1 B400 Bill Biley 2 B400 Bill Ballou 2 B400 Bill Boele 1 B400 Bill Balow 2 B400 Bill Builly 3 B400 Bill Billiou 2 B400 Bill Bayol 2 B400 Bill Bella 2 B400 Bill Ballie 2 B400 input name sybcount soundex Louis Laux 1 L200 Louis Laske 1 L200 Louis Leigha 2 L200 Louis Luis 2 L200 Louis Lasse 1 L200 Louis Lucija 3 L220 Louis Lezo 2 L200 Louis Laessig 2 L220 Louis Luiz 2 L200 Louis Lucek 2 L220
Without prefixing the soundex return, the results are more narrow, for example:
input name sybcount soundex William Willaim 2 W450 William Wollin 2 W450 William Willman 2 W455 William Wallin 2 W450 William Wahlman 2 W455 William Whelan 2 W450 William Wallman 2 W455 William William* 3 W450 William Woolen 2 W450 William Willam 2 W450
sybcount is a rough function I’m working on to count syllables in the name returned. Since I’m looking to do sounds like for full names, I’m going to expand the script a bit to get given and surnames and return full name. For sounds like, I want the user to have the flexibility of options like:
- Bob Newhart
- % New%
- Bob New%
In this case, using % as a wild card. The first choice, the first name would sound like Bob, the last would sound like Newhart.
Looking at option 1, if I set no prefix for the soundex, meaning use the full soundex code, Bob returns nothing. If I set the prefix to 1, then it returns anything that starts with B or N. The results are not impressive:
soundslike: Bob, Newhart soundex, B, N SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'B%' order by rand() limit 10 SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'N%' order by rand() limit 10 inputs name sybcount Bob Newhart Babette Nicolette 6 Bob Newhart Bibi Nowland 4 Bob Newhart Bijan Nuffer 4 Bob Newhart Breanne Nasri 4 Bob Newhart Bahnbeamter Nedina 6 Bob Newhart Baldassare Nagengast 7 Bob Newhart Beggue Nolette 4 Bob Newhart Baptiste Nabavian 7 Bob Newhart Byron Najera 5 Bob Newhart Barnardus Nickless 5
No prefix works with some names, but still, the results are not great:
soundslike: Jimmy, Carter soundex, J500, C636 SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'J500%' order by rand() limit 10 SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'C636%' order by rand() limit 10 inputs name sybcount Jimmy Carter Joannah Cryder 4 Jimmy Carter Jenna Courtright 4 Jimmy Carter Jemina Carrethers 6 Jimmy Carter Jamey Caruthers 5 Jimmy Carter Jana Carrothers 5 Jimmy Carter Jannie Crowther 4 Jimmy Carter Jane Cordeiro 5 Jimmy Carter Janie Carruthers 5 Jimmy Carter Jesenia Cardeiro 7 Jimmy Carter Jasna Cardera 5
Metaphone is a popular sounds like alternative. So, again, I want to convert the incoming sounds like names into metaphone code, the restrict the result set to names that match part of the code.
$metaphonef = metaphone($soundslikef, $prefix);
Metaphone with an unlimited phonemes returned better results for the Bob Newhart use case:
soundslike: Bob, Newhart metaphone: BB, NHRT inputs name sybcount Bob Newhart Bibi Neuhart 4 Bob Newhart Baby Neyhart 4 Bob Newhart Bebe Newhard 4 Bob Newhart Bobbie Nihart 4 Bob Newhart Bobbi Newhart 4 Bob Newhart Neihart 2 Bob Newhart 0 Bob Newhart 0 Bob Newhart 0 Bob Newhart 0
If I cut the phonemes down to 2, the results are more promising:
soundslike: Bob, Newhart metaphone: BB, NH inputs name sybcount Bob Newhart Babette Nyhan 5 Bob Newhart Bobbye Nihart 4 Bob Newhart Bobbie Nyhus 4 Bob Newhart Babatunde Neihart 6 Bob Newhart Bibi Nhep 3 Bob Newhart Bebe Newhook 4 Bob Newhart Bobak Newhall 4 Bob Newhart Bobbi Neehouse 4 Bob Newhart Boban Nohe 3 Bob Newhart Babak Newham 4
Many folks suggest a combination of Metaphone and Levenshtein distance calculations. Use the Levenshtein distance to set an error margin. Levenshtein distance calculates the variance between two strings. So two strings with a distance of 1, there is one variance between the two. So, the idea is to calculate the Metaphone code for the input variable, and each name, then determine a percent difference to look for matches. This is an interesting idea, and probably more useful for narrowing searches, or fuzzy input matches, but it is not yielding good results in this case.
Looks like the Metaphone search with no limit is giving me the best results. Here are a couple of name examples:
soundslike: Jimmy, Carter Metaphone metaphone: JM, KRTR inputs name sybcount Jimmy Carter Jaimee Grider 4 Jimmy Carter Gema Graeter 4 Jimmy Carter Jomo Kreider 4 Jimmy Carter Jaymie Gorter 4 Jimmy Carter Jem Greider 3 Jimmy Carter Jamee Cardeiro 5 Jimmy Carter Jimmie Cordrey 4 Jimmy Carter Jammie Credeur 4 Jimmy Carter Jamey Cartier 4 Jimmy Carter Jimmy Crader 4 Metaphone Levenshtein Distance metaphone: JM, KRTR inputs name sybcount Jimmy Carter Gentien Grentmesnil 5 Jimmy Carter Shaunta Aristizabal 7 Jimmy Carter Ivarson Brightharp 5 Jimmy Carter Ersan Hilderbrandt 5 Jimmy Carter Thiphaine Macnaughton 6 Jimmy Carter Katheleen Policastri 7 Jimmy Carter Dorothee Vonderhaar 7 Jimmy Carter Rebeccah 3 Jimmy Carter Ederson 3 Jimmy Carter Svaer 1 Soundex soundex, J500, C636 inputs name sybcount Jimmy Carter Jazmine Cryderman 6 Jimmy Carter Jaymie Czartoryski 6 Jimmy Carter Jonn Cardero 4 Jimmy Carter Jón Cordray 2 Jimmy Carter Jannine Charter 5 Jimmy Carter Jane Crouter 4 Jimmy Carter Jesenia Cruthird 6 Jimmy Carter Joon-ho Cortright 4 Jimmy Carter Jimmy Courter 4 Jimmy Carter Joannin Carrethers 5 soundslike: Mary, Sue Metaphone metaphone: MR, S inputs name sybcount Mary Sue Meir Sooy 3 Mary Sue Meri Zou 3 Mary Sue Miri Sy 3 Mary Sue Mr Sey 1 Mary Sue Mary Hsu 3 Mary Sue Maura Zeh 3 Mary Sue Merri Su 3 Mary Sue Maihri Hsieh 3 Mary Sue Mairi Sou 3 Mary Sue Mária Sea 3 Metaphone Levenshtein Distance metaphone: MR, S inputs name sybcount Mary Sue Jerimas Gabrielsen 6 Mary Sue Evariste Nett 5 Mary Sue Norfolkeast Patsula 6 Mary Sue Rebecka Suggs 4 Mary Sue Denisha Klemetz 5 Mary Sue Tressie Halgas 4 Mary Sue Carita Robida 6 Mary Sue Wesley Mish 3 Mary Sue Cezary Gosson 5 Mary Sue Hupert Puelo 4 Soundex soundex, M600, S000 inputs name sybcount Mary Sue Muammer Saka 5 Mary Sue Munir Shyu 3 Mary Sue Marya Secky 4 Mary Sue Maira Schuh 3 Mary Sue Mireya Saia 5 Mary Sue Meri Sheeks 3 Mary Sue Mara Sow 3 Mary Sue Mhairi Schewe 3 Mary Sue Munira Schei 4 Mary Sue Maria Sockey 5
Ok. Decision made. I’m going with the simple Metaphone approach. On the server side of the application, this should be pretty easy, php has a built in metaphone() function. However, Object-c will be a little more challenging. The algorithm for the metaphone and double metaphone are publicly available, so I’ll probably just end up writing my own function.
One thing I need to sleep on, should I pre calculate all of the metaphone codes and make a data structure change to accommodate the new column. This would allow me to greatly reduce the time to find matches. Before making the data structure change, I will probably mock it up and see if it will work as expected.
That should be enough nerd talk for today.