Issue 16504

fuzzymatch failure

16504
Reporter: feedback bot
Assignee: mdoering
Type: Bug
Summary: fuzzymatch failure
Resolution: Fixed
Status: Closed
Created: 2014-09-26 19:30:35.051
Updated: 2015-05-12 10:47:18.182
Resolved: 2015-04-01 14:41:02.269
        
        
Description: http://www.gbif.org/occurrence/436266419
in this particular case, the fuzzy match replaces the wasp name (Eirone neocaledonica Williams) to a spider name (Erigone neocaledonica Kritscher, 1966). I know, it looks similar enough, but it has completely different classification, they have different authors and year of publication, but GBiF does not take this into the account and replaces one species name with another.
Which is worth, the original data are not visible. It took 3 people to look at the page several times before we could understand what is going on, and find a tiny "verbatim content" hyperlink to see the difference.
Not very good implementation of the fuzzy match and not very intuitive representation of the replaced data. I do like the idea of clearing of the data, but please do not hide the originals, put the on the same screen, so that people can see the difference.

*Reporter*: Dmitry Dmitriev
*E-mail*: [mailto:arboridia@gmail.com]]]>

Attachment lookup difference.xls


Author: mdoering@gbif.org
Created: 2014-09-29 10:28:09.095
Updated: 2014-09-29 10:28:09.095
        
Indeed a wrong match, but not that obvious. The wasp & spider classification has the same kingdom & phylum and the name is very much alike. Together with the fact that our backbone lacks the wasp species name this leads to the false fuzzy match:

http://api.gbif.org/v1/species/match?verbose=true&kingdom=Animalia&phylum=Arthropoda&class=Insecta&order=Hymenoptera&family=Tiphiidae&genus=Eirone&name=Eirone%20neocaledonica%20Williams

If you match the genus name alone (which exists in our backbone) we hit the correct genus:
http://api.gbif.org/v1/species/match?verbose=true&kingdom=Animalia&phylum=Arthropoda&class=Insecta&order=Hymenoptera&family=Tiphiidae&genus=Eirone&name=Eirone

Maybe a solution to reduce false matches is to query for the genus first (as we nearly always have the genus) and only then limit the species matches to that genus?


Author: mdoering@gbif.org
Comment: Added clb test https://github.com/gbif/checklistbank/blob/bf491583e34ed506bf6c5b895f42b15b187b97aa/checklistbank-nub/src/test/java/org/gbif/nub/lookup/NubMatchingServiceImplIT.java#L344
Created: 2015-03-20 14:18:49.89
Updated: 2015-03-20 14:18:49.89


Author: mdoering@gbif.org
Created: 2015-03-24 10:02:46.194
Updated: 2015-03-24 10:02:46.194
        
Instead of writing a whole new matching algorithm I swapped out the Jaro Winkler distance in favor of a classic modified Damerau Levenshtein distance and tighten up the classification match too. The Jaro Winkler distance becoems too fuzzy for long name strings and sometimes produces far too similar values for rather distant name strings (even though it is a preferred distance measure for matching peoples names):

https://github.com/gbif/checklistbank/commit/252ae5810ad8e13fe25cfecfd0296ae574895337

With these changes all new tests match correctly. I will do a new dev lookup now and compare results with the existing lookup.

Author: rdmpage
Created: 2015-03-24 10:24:24.973
Updated: 2015-03-24 10:24:24.973

Hi Markus,

In my experience part of the problem is a lack of names, that is, the name being matched is a real name, but is not in GBIF, hence fuzzy match is employed and it sometimes get's the wrong hit. One way to tackle this, obviously, is to grab more names.

Another might be to understand what the expected distribution of fuzzy matches would look like. Given existing names in GBIF, do we have a sense of what is the likelihood of two "real" names differing by, say, two edits? Is this partly a function of the characters in the strings (some misspellings are more likely then others)? Maybe the work involved in doing this would outweigh any benefits, but it might be interesting to be able to assign some sort of probability that two names are alternative spellings rather than likely to be different names. I guess Tony Rees would have something to say on this.

I gather that Google, despite Peter Norvig's elegant essay http://norvig.com/spell-correct.html actually rely a lot on analysing people's response to query results. E.g., if they type in a query and Google suggests an alternative spelling, do people click on that? If people type in a query, then another variation on that query and get more hits/better result (indicated by them clicking on the results), then that suggests an alternative spelling. Again, work would be needed, but there's potentially a lot of data in what GBIF's users are doing that might help.

Rod

Author: mdoering@gbif.org
Created: 2015-03-24 10:48:35.787
Updated: 2015-03-24 10:48:35.787

Hi Rod,
yes I agree it would be really useful to better understand how alike our names together with their classification actually are. This is a little difficult though as we do not have a good reference dataset where I actually know if the name was a real one or just a variation.

I will nevertheless try to do some analysis of the use of the different name strings and their similarities.
Basic metrics like occ count by string length, monomial vs binomial vs trinomial and hopefully for each name also the shortest edit (Levenshtein or Damerau L. distance to the next most similar name. That distance would be great to get for all names and also just for names within the same kingdom.

As for Tony Rees I have ported the character transpositions found in his Taxon Match algorithm to a lucene filter which is in use by GBIF for some time already:
https://github.com/gbif/checklistbank/blob/master/checklistbank-nub/src/main/java/org/gbif/nub/lookup/ScientificNameSoundAlikeFilter.java

But I would like to get more clever in the future by not only translating some characters into others, but also to use different distance scores for different character classes, e.g. a transposition of i e or y should be less distant than within consonants.

We also normalize names to some degree before we apply the matching by piping them through our name parser first. From the parsed name we construct a canonical name without authorship or rank marker and decompose unicode letters in particular ligatures:
https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/model/checklistbank/ParsedName.java#L327


Author: mdoering@gbif.org
Comment: trying the modified new lookup against 56.000 specimens from the NHM London there are 3030 (6%) different results, see attached lookup difference.xls. Many are names with undetermined species names of the form Genus sp. which was a bug fixed. All results scanned so far seem to be improvements
Created: 2015-03-24 23:04:10.208
Updated: 2015-03-24 23:07:47.537


Author: mdoering@gbif.org
Created: 2015-04-01 14:40:14.141
Updated: 2015-04-01 14:40:14.141
        
Version 2.13 of clb now contains the new lookup code and is deployed to prod. Reinterpretation of occurrences with a changed match results on its way (5.3 million records)

All commits related to the new nub index:
https://github.com/gbif/checklistbank/commit/252ae5810ad8e13fe25cfecfd0296ae574895337

tests:
https://github.com/gbif/checklistbank/commit/ab8ddd7ba974394b6a9869dc9109bc330a4df777
https://github.com/gbif/checklistbank/commit/7d4d56c1fd414a2a1505217743177ac676f52c05
https://github.com/gbif/checklistbank/commit/a6f77b66df1f0eaf0c0f3d5d334176ad2474a14b
https://github.com/gbif/checklistbank/commit/bf491583e34ed506bf6c5b895f42b15b187b97aa


Author: mdoering@gbif.org
Comment: Changes documented at http://gbif.blogspot.de/2015/03/improving-gbif-backbone-matching.html
Created: 2015-04-01 14:40:52.082
Updated: 2015-04-01 14:40:52.082


Author: mdoering@gbif.org
Comment: correctly interpreted to the genus now: http://www.gbif.org/occurrence/924390052/verbatim
Created: 2015-04-23 09:25:44.812
Updated: 2015-04-23 09:26:01.212