I’ve been doing a lot of reading lately about soundex and other algorithms for matching words. In my iOS application, I have a simple alliteration function, basically just looking for matching starting characters. in the next version, I want to expand the alliteration and have a sounds like function.
Soundex is a nice simple function, primarily designed to work with surnames, that converts a name to a starting consonant sound and digits representing the remaining sounds. It’s a coding system that’s been around since the early 1900’s. It works, and it’s easy to use, but it is limited to english.
So, I setup a simple php test script to look at soundex results. The script creates a soundex code of the input, takes the first three characters of the soudex code and returns the results of an sql call:
$soundex = soundex($soundslike);
$soundexPrefix = substr($soundex, 0, 3);
$sql = "SELECT name from names WHERE SOUNDEX(name) LIKE '$soundexPrefix%' order by rand() limit 10";
This works ok. Here are a couple of examples:
input name sybcount soundex
Bill Bloi 1 B400
Bill Biley 2 B400
Bill Ballou 2 B400
Bill Boele 1 B400
Bill Balow 2 B400
Bill Builly 3 B400
Bill Billiou 2 B400
Bill Bayol 2 B400
Bill Bella 2 B400
Bill Ballie 2 B400
input name sybcount soundex
Louis Laux 1 L200
Louis Laske 1 L200
Louis Leigha 2 L200
Louis Luis 2 L200
Louis Lasse 1 L200
Louis Lucija 3 L220
Louis Lezo 2 L200
Louis Laessig 2 L220
Louis Luiz 2 L200
Louis Lucek 2 L220
Without prefixing the soundex return, the results are more narrow, for example:
input name sybcount soundex
William Willaim 2 W450
William Wollin 2 W450
William Willman 2 W455
William Wallin 2 W450
William Wahlman 2 W455
William Whelan 2 W450
William Wallman 2 W455
William William* 3 W450
William Woolen 2 W450
William Willam 2 W450
sybcount is a rough function I’m working on to count syllables in the name returned. Since I’m looking to do sounds like for full names, I’m going to expand the script a bit to get given and surnames and return full name. For sounds like, I want the user to have the flexibility of options like:
- Bob Newhart
- % New%
- Bob New%
In this case, using % as a wild card. The first choice, the first name would sound like Bob, the last would sound like Newhart.
Looking at option 1, if I set no prefix for the soundex, meaning use the full soundex code, Bob returns nothing. If I set the prefix to 1, then it returns anything that starts with B or N. The results are not impressive:
soundslike: Bob, Newhart
soundex, B, N
SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'B%' order by rand() limit 10
SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'N%' order by rand() limit 10
inputs name sybcount
Bob Newhart Babette Nicolette 6
Bob Newhart Bibi Nowland 4
Bob Newhart Bijan Nuffer 4
Bob Newhart Breanne Nasri 4
Bob Newhart Bahnbeamter Nedina 6
Bob Newhart Baldassare Nagengast 7
Bob Newhart Beggue Nolette 4
Bob Newhart Baptiste Nabavian 7
Bob Newhart Byron Najera 5
Bob Newhart Barnardus Nickless 5
No prefix works with some names, but still, the results are not great:
soundslike: Jimmy, Carter
soundex, J500, C636
SELECT name from names WHERE givenname = 1 and SOUNDEX(name) LIKE 'J500%' order by rand() limit 10
SELECT name from names WHERE surname = 1 and SOUNDEX(name) LIKE 'C636%' order by rand() limit 10
inputs name sybcount
Jimmy Carter Joannah Cryder 4
Jimmy Carter Jenna Courtright 4
Jimmy Carter Jemina Carrethers 6
Jimmy Carter Jamey Caruthers 5
Jimmy Carter Jana Carrothers 5
Jimmy Carter Jannie Crowther 4
Jimmy Carter Jane Cordeiro 5
Jimmy Carter Janie Carruthers 5
Jimmy Carter Jesenia Cardeiro 7
Jimmy Carter Jasna Cardera 5
Metaphone is a popular sounds like alternative. So, again, I want to convert the incoming sounds like names into metaphone code, the restrict the result set to names that match part of the code.
$metaphonef = metaphone($soundslikef, $prefix);
Metaphone with an unlimited phonemes returned better results for the Bob Newhart use case:
soundslike: Bob, Newhart
metaphone: BB, NHRT
inputs name sybcount
Bob Newhart Bibi Neuhart 4
Bob Newhart Baby Neyhart 4
Bob Newhart Bebe Newhard 4
Bob Newhart Bobbie Nihart 4
Bob Newhart Bobbi Newhart 4
Bob Newhart Neihart 2
Bob Newhart 0
Bob Newhart 0
Bob Newhart 0
Bob Newhart 0
If I cut the phonemes down to 2, the results are more promising:
soundslike: Bob, Newhart
metaphone: BB, NH
inputs name sybcount
Bob Newhart Babette Nyhan 5
Bob Newhart Bobbye Nihart 4
Bob Newhart Bobbie Nyhus 4
Bob Newhart Babatunde Neihart 6
Bob Newhart Bibi Nhep 3
Bob Newhart Bebe Newhook 4
Bob Newhart Bobak Newhall 4
Bob Newhart Bobbi Neehouse 4
Bob Newhart Boban Nohe 3
Bob Newhart Babak Newham 4
Many folks suggest a combination of Metaphone and Levenshtein distance calculations. Use the Levenshtein distance to set an error margin. Levenshtein distance calculates the variance between two strings. So two strings with a distance of 1, there is one variance between the two. So, the idea is to calculate the Metaphone code for the input variable, and each name, then determine a percent difference to look for matches. This is an interesting idea, and probably more useful for narrowing searches, or fuzzy input matches, but it is not yielding good results in this case.
Looks like the Metaphone search with no limit is giving me the best results. Here are a couple of name examples:
soundslike: Jimmy, Carter
Metaphone
metaphone: JM, KRTR
inputs name sybcount
Jimmy Carter Jaimee Grider 4
Jimmy Carter Gema Graeter 4
Jimmy Carter Jomo Kreider 4
Jimmy Carter Jaymie Gorter 4
Jimmy Carter Jem Greider 3
Jimmy Carter Jamee Cardeiro 5
Jimmy Carter Jimmie Cordrey 4
Jimmy Carter Jammie Credeur 4
Jimmy Carter Jamey Cartier 4
Jimmy Carter Jimmy Crader 4
Metaphone Levenshtein Distance
metaphone: JM, KRTR
inputs name sybcount
Jimmy Carter Gentien Grentmesnil 5
Jimmy Carter Shaunta Aristizabal 7
Jimmy Carter Ivarson Brightharp 5
Jimmy Carter Ersan Hilderbrandt 5
Jimmy Carter Thiphaine Macnaughton 6
Jimmy Carter Katheleen Policastri 7
Jimmy Carter Dorothee Vonderhaar 7
Jimmy Carter Rebeccah 3
Jimmy Carter Ederson 3
Jimmy Carter Svaer 1
Soundex
soundex, J500, C636
inputs name sybcount
Jimmy Carter Jazmine Cryderman 6
Jimmy Carter Jaymie Czartoryski 6
Jimmy Carter Jonn Cardero 4
Jimmy Carter Jón Cordray 2
Jimmy Carter Jannine Charter 5
Jimmy Carter Jane Crouter 4
Jimmy Carter Jesenia Cruthird 6
Jimmy Carter Joon-ho Cortright 4
Jimmy Carter Jimmy Courter 4
Jimmy Carter Joannin Carrethers 5
soundslike: Mary, Sue
Metaphone
metaphone: MR, S
inputs name sybcount
Mary Sue Meir Sooy 3
Mary Sue Meri Zou 3
Mary Sue Miri Sy 3
Mary Sue Mr Sey 1
Mary Sue Mary Hsu 3
Mary Sue Maura Zeh 3
Mary Sue Merri Su 3
Mary Sue Maihri Hsieh 3
Mary Sue Mairi Sou 3
Mary Sue Mária Sea 3
Metaphone Levenshtein Distance
metaphone: MR, S
inputs name sybcount
Mary Sue Jerimas Gabrielsen 6
Mary Sue Evariste Nett 5
Mary Sue Norfolkeast Patsula 6
Mary Sue Rebecka Suggs 4
Mary Sue Denisha Klemetz 5
Mary Sue Tressie Halgas 4
Mary Sue Carita Robida 6
Mary Sue Wesley Mish 3
Mary Sue Cezary Gosson 5
Mary Sue Hupert Puelo 4
Soundex
soundex, M600, S000
inputs name sybcount
Mary Sue Muammer Saka 5
Mary Sue Munir Shyu 3
Mary Sue Marya Secky 4
Mary Sue Maira Schuh 3
Mary Sue Mireya Saia 5
Mary Sue Meri Sheeks 3
Mary Sue Mara Sow 3
Mary Sue Mhairi Schewe 3
Mary Sue Munira Schei 4
Mary Sue Maria Sockey 5
Ok. Decision made. I’m going with the simple Metaphone approach. On the server side of the application, this should be pretty easy, php has a built in metaphone() function. However, Object-c will be a little more challenging. The algorithm for the metaphone and double metaphone are publicly available, so I’ll probably just end up writing my own function.
One thing I need to sleep on, should I pre calculate all of the metaphone codes and make a data structure change to accommodate the new column. This would allow me to greatly reduce the time to find matches. Before making the data structure change, I will probably mock it up and see if it will work as expected.
That should be enough nerd talk for today.
Enjoy!