Issue 18370

Cyrillic names confuse species matching

18370
Reporter: mdoering
Type: Feedback
Summary: Cyrillic names confuse species matching
Priority: Minor
Status: Open
Created: 2016-04-06 11:08:00.584
Updated: 2016-07-25 16:55:25.25
        
        
Description: Even though the species name is given in perfect latin there is no species match for occurrences where all the higher classification is given in cyrillic. The species matching should probably be modified to ignore non latin names and we should try to lookup kingdoms in non latin alphabets at least. Maybe a "simple transliteration even works?

A species match with just the name is fine:
http://api.gbif.org/v1/species/match?verbose=true&kingdom=&phylum=&class=&order=&family=&name=Allium%20arkitense%20R.M.%20Fritsch

But once the cyrillic higher taxa are added the matching service think this is an entire different classification, most importantly the kingdom and family, is not matching up. The match still works though with the latin family given:

http://api.gbif.org/v1/species/match?verbose=true&kingdom=%D0%A0%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D1%8F&phylum=%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D0%B5&class=%D0%BE%D0%B4%D0%BD%D0%BE%D0%B4%D0%BE%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5&order=%D0%A1%D0%BF%D0%B0%D1%80%D0%B6%D0%B0%D1%86%D0%B2%D0%B5%D1%82%D0%BD%D1%8B%D0%B5&family=Amaryllidaceae%20Jaume&name=Allium%20arkitense%20R.M.%20Fritsch

I assume the match was not working when we did not strip authorships from higher taxa, see https://github.com/gbif/checklistbank/commit/36e9ac7f6a1e044fb8985f70dfca29a0ca282da9. A reindexing of this dataset should solve it
]]>
    


Author: mdoering@gbif.org
Created: 2016-04-06 13:30:29.721
Updated: 2016-04-06 13:30:29.721
        
Reprocessing matched all species, so this is a minor issue now:
http://www.gbif.org/dataset/1962b370-4fd5-4b22-a30e-0adb8570688d/stats
    


Author: mblissett
Created: 2016-07-25 16:55:25.25
Updated: 2016-07-25 16:55:25.25
        
kingdom=Растения
phylum=цветковые
class=однодольные
order=Спаржацветные

Transliterated using an online transliterator gives

kingdom=Rastenija (means plants)
phylum=cvetkovye (means flowering plants)
class=odnodol'nye (means monocots)
order=Sparzhacvetnye (means Asparagales)

So the Russian terms aren't simple transliterations of Latin.