Issue 18370
Cyrillic names confuse species matching
18370
Reporter: mdoering
Type: Feedback
Summary: Cyrillic names confuse species matching
Priority: Minor
Status: Open
Created: 2016-04-06 11:08:00.584
Updated: 2016-07-25 16:55:25.25
Description: Even though the species name is given in perfect latin there is no species match for occurrences where all the higher classification is given in cyrillic. The species matching should probably be modified to ignore non latin names and we should try to lookup kingdoms in non latin alphabets at least. Maybe a "simple transliteration even works?
A species match with just the name is fine:
http://api.gbif.org/v1/species/match?verbose=true&kingdom=&phylum=&class=&order=&family=&name=Allium%20arkitense%20R.M.%20Fritsch
But once the cyrillic higher taxa are added the matching service think this is an entire different classification, most importantly the kingdom and family, is not matching up. The match still works though with the latin family given:
http://api.gbif.org/v1/species/match?verbose=true&kingdom=%D0%A0%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D1%8F&phylum=%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D0%B5&class=%D0%BE%D0%B4%D0%BD%D0%BE%D0%B4%D0%BE%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5&order=%D0%A1%D0%BF%D0%B0%D1%80%D0%B6%D0%B0%D1%86%D0%B2%D0%B5%D1%82%D0%BD%D1%8B%D0%B5&family=Amaryllidaceae%20Jaume&name=Allium%20arkitense%20R.M.%20Fritsch
I assume the match was not working when we did not strip authorships from higher taxa, see https://github.com/gbif/checklistbank/commit/36e9ac7f6a1e044fb8985f70dfca29a0ca282da9. A reindexing of this dataset should solve it
]]>
Author: mdoering@gbif.org
Created: 2016-04-06 13:30:29.721
Updated: 2016-04-06 13:30:29.721
Reprocessing matched all species, so this is a minor issue now:
http://www.gbif.org/dataset/1962b370-4fd5-4b22-a30e-0adb8570688d/stats