Issue 17751

Try to detect and merge nub species spelling variations

17751
Reporter: mdoering
Type: Improvement
Summary: Try to detect and merge nub species spelling variations
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2015-08-06 00:41:32.729
Updated: 2016-07-22 16:28:58.525
Resolved: 2016-07-13 15:29:27.171
        
Description: The nub currently contains many very closely spelled names that immediately feel they must be are the same due to their "soundex similarity:

Within the Apiaceae children alone:

 - Tamamschjania rhizomatica
 - Tamamschjanella rhizomatica

 - Szovitsia
 - Szowitsia

 - Taenioplehrum
 - Taeniopleurum

 - Strebanthus
 - Streblanthus

 - Stenocaelium
 - Stenocoelium

http://www.gbif.org/species/3637597
http://www.gbif.org/species/6027022]]>
    


Author: mblissett
Comment: POR-2824 is related to this.
Created: 2015-12-17 12:00:17.58
Updated: 2015-12-17 12:00:17.58


Author: rdmpage
Comment: [~mdoering@gbif.org] I'm facing the same issue in BioNames as ION has a lot of names that are obviously spelling variants. I suspect some clever "blocking" (e.g., compare names within genus, compare generic names that start with same characters) followed by string comparison would help. There are also some fast approximate string comparison algorithms (agrep and simhash) that could be used.
Created: 2015-12-17 12:18:32.847
Updated: 2015-12-17 12:18:32.847


Author: mblissett
Created: 2015-12-17 13:32:29.535
Updated: 2015-12-17 13:32:29.535
        
I will also see if some of the ideas behind the Reconciliation Framework I was working on at Kew can be used: https://github.com/RBGKew/Reconciliation-and-Matching-Framework

Some taxonomists/botanists put a lot of work into those rules in an effort to avoid overmatching, but it's likely that the data Kew was cleaning is a lot tidier than what GBIF needs to handle.

    


Author: mdoering@gbif.org
Comment: ... and we deal with animal, plant, fungi and bacterial names. Maybe it is time to apply more specific rules to the individual kingdoms or codes
Created: 2015-12-17 13:39:11.012
Updated: 2015-12-17 13:39:11.012


Author: mdoering@gbif.org
Created: 2016-05-12 11:15:34.997
Updated: 2016-05-12 11:15:34.997
        
Some more examples discovered by Scott in the new April 2016 backbone:
-----
1. Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis (http://www.gbif.org/species/2683551)

Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.

-----
2. Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis (http://www.gbif.org/species/3594916 )

Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.

-----
3. Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona perrierii (http://www.gbif.org/species/6308376 )

Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
    


Author: mdoering@gbif.org
Created: 2016-07-13 15:29:27.297
Updated: 2016-07-13 15:29:27.297
        
Implemented in https://github.com/gbif/checklistbank/commit/f7e9ea4d4669cd9b4489902b4111f1fac5d0fe00 using the SciNameNormalizer to merge similar canonical names. Different authors are still recognized as different names.

See https://github.com/gbif/checklistbank/blob/f7e9ea4d4669cd9b4489902b4111f1fac5d0fe00/checklistbank-common/src/main/java/org/gbif/checklistbank/utils/SciNameNormalizer.java#L21 and https://github.com/gbif/checklistbank/blob/master/checklistbank-common/src/test/java/org/gbif/checklistbank/utils/SciNameNormalizerTest.java#L11