THE-SAS-MOM
Follow Us:
  • Welcome
  • About SAS-Mom
  • SAS Blog
  • Typical SAS Interview Questions
  • SAS Certifications
  • Python Blog
  • Contact SAS-Mom

Fuzzy Matching using Fuzzywuzzy

25/1/2019

1 Comment

 
Have you ever wanted to determine just how similar two strings are? Using the equal sign returns whether two strings are the same or not. It does not give any information or measure regarding how similar or dissimilar two strings are.
Picture
Python’s Fuzzywuzzy library contains many methods that can be used to compute a similarity measure for two strings. The Fuzzywuzzy library contains a module called fuzz that contains several methods that can be used to compare two strings and return a value from 0 to 100 as a measure of similarity.
Picture
Strings that are completely different would have a similarity score of 0 whereas strings that a completely identical would have a similarity score of 100. Strings with some level of similarity would fall between 0 and 100.

Fuzzywuzzy uses Levenshtein Distance to calculate the differences between sequences. This library also utilises the difflib library under the hood of all of its calculations. 

The string comparison methods within the fuzz module of the Fuzzywuzzy library are:
  • QRatio
  •  UQRatio
  • UWRatio
  • WRatio
  • partial_ratio
  • partial_token_set_ratio
  • partial_token_sort_ratio
  • ratio
  • token_set_ratio
  • token_sort_ratio

​The following code demonstrates how some of these methods work on strings of varying similarity.
Picture
Fuzzywuzzy also contains a module called process which contains methods for determining how similar one string is to a list of other strings.

The methods in the process module are as follows:
  • extract
  • extractBests
  • extractOne
  • extractWithoutOrder
  • dedupe

The following code demonstrates how some of these methods work.
​
Picture
The Fuzzywuzzy library has many use cases. Imagine trying to determine how similar names, addresses, telephone numbers and dates are between two datasets. This library can be used to find matching records between datasets where a similarity score makes more sense over a definitive True or False regarding equality.
​

Happy Learning!
Vanaudel Analytix
View my profile on LinkedIn
1 Comment
Vanessa Afolabi
26/7/2019 02:13:36 pm

Hello I am testing this.

Reply



Leave a Reply.

    Author

    My name is Vanessa Afolabi also known as @TheSASMom. I am a Data Scientist fluent in SAS, R, Python and SQL with a passion for Machine Learning and Research.

    Archives

    July 2019
    January 2019

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.