August 16, 2003

Soundex Surname Code

This technique was originally developed by Margaret K. Odell and Robert C. Russell for use in encoding surnames for the U.S. Census Bureau. This technique is designed to bring together a great variety of surname spellings which might represent errors in the spelling of the last name during data entry or inquiry.

The code consists of a four position code. The first position contains the first letter of the surname followed by three positions encoded as follows:

  1. B, P, F, V
  2. C, G, J, K, Q, S, X, Z
  3. D, T
  4. L
  5. M, N
  6. R

Letters whose codes do not appear on this list will be ignored. If there are sufficient letters in the name to generate a code longer than four digits, then all other letters in the name will be discarded. Surnames that do not generate a code of sufficient length will have the extra digits filled with zeroes.

Code numbers will never be repeated in succession; only the first occurrence will be used. This rule also applies to the first letter of the surname. If the next coded letter of the name matches what the coded value of the first letter would have been, then it will be discarded.

Examples of correct soundex codes are as follows: Smith–S530, Scott–S300, Miller–M460, Schmit–S530, Williams–W452, McDougall–M232, and Bilbo–B410. The selection of the correct name from the list of similar sounding names will be made by examining the other criteria presented along with the surnames.

Posted by pscott at August 16, 2003 11:03 PM