Soundex Fuzzy logic

Soundex Fuzzy logic

Improve your searches with Soundex!

Looking to create a 'Did you mean' feature for your searches?

There's a very simple way to implement fuzzy logic that I have been using for years.

It requires...

  • A field capable of storing 4 characters I use a CHAR (4) in my SQL server Db
  • A couple of functions to encode/decode strings to/from Soundex - supplied below for C# and see page source for client-side javascript version

Try the javascript version below...

Enter a word:      
Soundex Code:   

The Soundex algorithm

It works by reducing words to 4 letters. The first one is always the first letter of the word. No surprises there.

Then letters are grouped, for instance AEIOUY. See the functions below.

  • Soundex codes always start with the first letter of the word and are always followed by three numbers. The numbers represents the first three remaining consonants in the surname. If there are not enough letters in the surname, zeros will be added until there are 3 digits. If the surname is very long, the numbers will be truncated to three. No matter how long or how short the surname, a soundex code always will have one letter followed by three digits.
  • Soundex Coding Guide (Consonants that sound alike have the same code)
    1 - B,P,F,V
    2 - C,S,G,J,K,Q,X,Z
    3 - D,T
    4 - L
    5 - M,N
    6 - R
  • The letters A,E,I,O,U,Y,H, and W are not used.
  • Words with adjacent letters having the same equivalent number are coded as one letter with a single number.

The C# Soundex Class

             
    public class Soundex
    {
        /// <summary>
		/// Create a Soundex lookup string
		/// </summary>
		/// <param name="strIn">input text to encode</param>
		/// <returns>strIn, converted to Soundex format</returns>
		/// <remarks>For 'fuzzy' comparison of strings</remarks>
        public static string Encode(string strIn)
        {
            string EncodeRet = default;
            try
            {
                string strOut;
                int IntI;
                int intPrev = 0;
                Char strChar;
                int intChar;
                bool fPrevSeparator;
                strOut = "";
                strIn = strIn.ToUpper();
                fPrevSeparator = true;
                strOut = strIn.Substring(0, 1);
                var loopTo = strIn.Length;
                for (IntI = 2; IntI <= loopTo; IntI++)
                {
                    // If the output string is full, quit now.
                    if (strOut.Length >= 4)
                    {
                        break;
                    }
                    // Get each character, in turn. If the
                    // character's a letter, handle it.
                    strChar = char.Parse(strIn.Substring(IntI - 1, 1));
                    if (Char.IsLetter(strChar))
                    {
                        //  .
                        intChar = CharCode(strChar);

                        // If the character's not empty, and if it's not
                        // the same as the previous character, tack it
                        // onto the end of the string.
                        if (intChar > 0)
                        {
                            if (fPrevSeparator | intChar != intPrev)
                            {
                                strOut = strOut + intChar;
                                intPrev = intChar;
                            }
                        }

                        fPrevSeparator = intChar == 0;
                    }
                }   // IntI
                    // Return the string, right padded with 0's.
                EncodeRet = strOut.PadRight(4, '0'); ;
            }
            catch
            {
                return "";
            }

            return EncodeRet;
        }
        /// <summary>
		/// Get the soundex character code
		/// </summary>
		/// <param name="strChar">input</param>
		/// <returns>soundex character code</returns>
		/// <remarks>used by Encode()</remarks>
        private static int CharCode(char strChar)
        {
            int CharCodeRet = default;
            try
            {
                switch (strChar)
                {
                    case 'A':
                    case 'E':
                    case 'I':
                    case 'O':
                    case 'U':
                    case 'Y':
                        {
                            CharCodeRet = 0;
                            break;
                        }

                    case 'C':
                    case 'G':
                    case 'J':
                    case 'K':
                    case 'Q':
                    case 'S':
                    case 'X':
                    case 'Z':
                        {
                            CharCodeRet = 2;
                            break;
                        }

                    case 'D':
                    case 'T':
                        {
                            CharCodeRet = 3;
                            break;
                        }

                    case 'M':
                    case 'N':
                        {
                            CharCodeRet = 5;
                            break;
                        }

                    case 'B':
                    case 'F':
                    case 'P':
                    case 'V':
                        {
                            CharCodeRet = 1;
                            break;
                        }

                    case 'L':
                        {
                            CharCodeRet = 4;
                            break;
                        }

                    case 'R':
                        {
                            CharCodeRet = 6;
                            break;
                        }

                    default:
                        {
                            CharCodeRet = -1;
                            break;
                        }
                }
            }
            catch (Exception)
            {
            }
            return CharCodeRet;
        }

        /// <summary>
		/// Return a number between 0 and 4 (4 being the best) indicating the similarity between the Soundex
		/// representation for two strings
		/// </summary>
		/// <param name="strItem1"> Strings to compare</param>
		/// <param name="strItem2"> Strings to compare</param>
		/// <param name="fIsSoundex">Are the strings already in Soundex format?</param>
		/// <returns>
		/// Integer between 0 (not similar) and 4 (very similar) indicating
		/// the similarity in the Soundex representation of the two strings.
		/// </returns>
		/// <remarks>Requires:   Encode</remarks>
        public static int dhSoundsLike(string strItem1, string strItem2, bool fIsSoundex)
        {
            int dhSoundsLikeRet = default;
            // Note:
            // This code is extremely low-tech. Don't laugh! It just compares
            // the two Soundex strings until it doesn't find a match, and returns
            // the position where the two diverged.
            // 
            // Remember, two Soundex strings are completely different if the
            // original words start with different characters. That is, this
            // function always returns 0 unless the two words begin with the
            // same character.

            try
            {
                int IntI;
                if (!fIsSoundex)
                {
                    strItem1 = Encode(strItem1);
                    strItem2 = Encode(strItem2);
                }

                for (IntI = 1; IntI <= 4; IntI++)
                {
                    if ((strItem1.Substring(IntI, 1) ?? "") != (strItem2.Substring(IntI, 1) ?? ""))
                    {
                        break;
                    }
                }	// IntI

                dhSoundsLikeRet = IntI - 1;
            }
            catch (Exception ex)
            {
                // ReportError("dhSoundsLike() :" + Constants.vbCrLf + ex.ToString() + Constants.vbCrLf + "strItem1=" + strItem1 + Constants.vbCrLf + "strItem2=" + strItem2 + Constants.vbCrLf);
            }

            return dhSoundsLikeRet;
        }
    }
        
    

Conclusion

It performs exceptionally well for something so deceptively simple. Your user searches will be rewarded!