Uploaded image for project: 'Portal'
  1. Portal
  2. POR-2812

Try to detect and merge nub species spelling variations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Checklistbank
    • Labels:

      Description

      The nub currently contains many very closely spelled names that immediately feel they must be are the same due to their "soundex similarity":

      Within the Apiaceae children alone:

      • Tamamschjania rhizomatica
      • Tamamschjanella rhizomatica
      • Szovitsia
      • Szowitsia
      • Taenioplehrum
      • Taeniopleurum
      • Strebanthus
      • Streblanthus
      • Stenocaelium
      • Stenocoelium

      http://www.gbif.org/species/3637597
      http://www.gbif.org/species/6027022

        Gliffy Diagrams

        Issue Links

          Activity

          Hide
          Matthew Blissett added a comment -

          POR-2824 is related to this.

          Show
          Matthew Blissett added a comment - POR-2824 is related to this.
          Hide
          Roderic D. M. Page added a comment -

          Markus Döring I'm facing the same issue in BioNames as ION has a lot of names that are obviously spelling variants. I suspect some clever "blocking" (e.g., compare names within genus, compare generic names that start with same characters) followed by string comparison would help. There are also some fast approximate string comparison algorithms (agrep and simhash) that could be used.

          Show
          Roderic D. M. Page added a comment - Markus Döring I'm facing the same issue in BioNames as ION has a lot of names that are obviously spelling variants. I suspect some clever "blocking" (e.g., compare names within genus, compare generic names that start with same characters) followed by string comparison would help. There are also some fast approximate string comparison algorithms (agrep and simhash) that could be used.
          Hide
          Matthew Blissett added a comment -

          I will also see if some of the ideas behind the Reconciliation Framework I was working on at Kew can be used: https://github.com/RBGKew/Reconciliation-and-Matching-Framework

          Some taxonomists/botanists put a lot of work into those rules in an effort to avoid overmatching, but it's likely that the data Kew was cleaning is a lot tidier than what GBIF needs to handle.

          Show
          Matthew Blissett added a comment - I will also see if some of the ideas behind the Reconciliation Framework I was working on at Kew can be used: https://github.com/RBGKew/Reconciliation-and-Matching-Framework Some taxonomists/botanists put a lot of work into those rules in an effort to avoid overmatching, but it's likely that the data Kew was cleaning is a lot tidier than what GBIF needs to handle.
          Hide
          Markus Döring added a comment -

          ... and we deal with animal, plant, fungi and bacterial names. Maybe it is time to apply more specific rules to the individual kingdoms or codes

          Show
          Markus Döring added a comment - ... and we deal with animal, plant, fungi and bacterial names. Maybe it is time to apply more specific rules to the individual kingdoms or codes
          Hide
          Markus Döring added a comment -

          Some more examples discovered by Scott in the new April 2016 backbone:


          1. Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis (http://www.gbif.org/species/2683551)

          Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.


          2. Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis (http://www.gbif.org/species/3594916 )

          Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.


          3. Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona perrierii (http://www.gbif.org/species/6308376 )

          Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL

          Show
          Markus Döring added a comment - Some more examples discovered by Scott in the new April 2016 backbone: 1. Macrozamia platyrachis ( http://www.gbif.org/species/4928834 ) vs. Macrozamia platyrhachis ( http://www.gbif.org/species/2683551 ) Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL. 2. Cycas circinalis ( http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis ( http://www.gbif.org/species/3594916 ) Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy. 3. Isolona perrieri ( http://www.gbif.org/species/3648546 ) vs Isolona perrierii ( http://www.gbif.org/species/6308376 ) Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
          Show
          Markus Döring added a comment - Implemented in https://github.com/gbif/checklistbank/commit/f7e9ea4d4669cd9b4489902b4111f1fac5d0fe00 using the SciNameNormalizer to merge similar canonical names. Different authors are still recognized as different names. See https://github.com/gbif/checklistbank/blob/f7e9ea4d4669cd9b4489902b4111f1fac5d0fe00/checklistbank-common/src/main/java/org/gbif/checklistbank/utils/SciNameNormalizer.java#L21 and https://github.com/gbif/checklistbank/blob/master/checklistbank-common/src/test/java/org/gbif/checklistbank/utils/SciNameNormalizerTest.java#L11

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Döring
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: