Issue 18187

Russian text breaks taxonomic and geographic parsing

18187
Reporter: rdmpage
Type: Feedback
Summary: Russian text breaks taxonomic and geographic parsing
Priority: Major
Status: Open
Created: 2016-01-31 12:16:06.147
Updated: 2016-02-15 13:45:38.375
        
Description: The data set  "monocotyledonous_geophytes_of_fergana_valley" http://www.gbif.org/dataset/1962b370-4fd5-4b22-a30e-0adb8570688d is a mixture of English and Russian, and this causes a bunch of parsing issues. For example, http://www.gbif.org/occurrence/1233581579 is flagged as "country invalid" when it is given as "Киргизия (= Kyrgyzstan).

Likewise the scientific name is given as "Allium cisferganense R.M. Fritsch" which is fine (see http://www.ipni.org/ipni/idPlantNameSearch.do?id=20007588-1 and http://www.zobodat.at/pdf/STAPFIA_0080_0381-0393.pdf,  this occurrence is the type specimen for this name). The verbatim record has a mixture of English and Russian versions of the taxonomic names at different levels. Curious as to why the scientific name has not been parsed, does the presence of mismatches higher up th taxonomic tree cause problems?

This dataset clearly posses some issues for the GBIF parser, but at the moment most of its information is being lost because it doesn't handle Russian.]]>


Author: mdoering@gbif.org
Created: 2016-02-01 11:55:36.542
Updated: 2016-02-01 11:55:36.542
        
The species match works if just the name is given:

http://api.gbif.org/v1/species/match?verbose=true&name=Allium%20cisferganense%20R.M.%20Fritsch

It still works if the higher tree down to family is given in cyrillic, but with quite some confidence penalty:
http://api.gbif.org/v1/species/match?verbose=true&name=Allium%20cisferganense%20R.M.%20Fritsch&kingdom=%D0%A0%D0%B0%D1%81%D1%82%D0%B5%D0%BD%D0%B8%D1%8F&phylum=%D1%86%D0%B2%D0%B5%D1%82%D0%BA%D0%BE%D0%B2%D1%8B%D0%B5&order=%D0%A1%D0%BF%D0%B0%D1%80%D0%B6%D0%B0%D1%86%D0%B2%D0%B5%D1%82%D0%BD%D1%8B%D0%B5&family=Amaryllidaceae%20Jaume


Author: mdoering@gbif.org
Created: 2016-02-01 11:56:21.656
Updated: 2016-02-01 11:56:21.656
        
I assume this comes from the authorship being different in the explicitly given term to the authorship string in scientificName

SCIENTIFICNAME Allium cisferganense R.M. Fritsch
SCIENTIFICNAMEAUTHORSHIP Reinhard M. Fritsch


Author: mdoering@gbif.org
Comment: Looks like there is a bug in our occurrence interpretation code. It hardcodes Rank.SPECIES when matching names
Created: 2016-02-01 12:56:36.033
Updated: 2016-02-01 12:56:36.033


Author: mblissett
Created: 2016-02-01 13:27:11.498
Updated: 2016-02-01 13:27:11.498
        
Country names:

At the moment, non-ASCII-alphabetic characters are ignored by the country parser.  This was probably to handle cases like Zaïre / Za€Äre, but it means Киргизия won't work.

I'll have a look at all the occurrence data, and see how common a country in a non-Latin alphabet is without an ISO code.


Author: mblissett
Created: 2016-02-01 17:12:51.137
Updated: 2016-02-01 17:12:51.137
        
That dataset is new, and is the only one with Russian in the country field, and no country code.  The provider is working on improving it, hopefully they can add a country code fairly easily — then the country field won't matter.

There's also one dataset with "대한민국" (South Korea), but everything else uses Latin script or else gives a country code.

I won't do any more on this, although in finding these cases I've come across about 400,000 or so Latin-script names that we ought to handle, so I've added them to the parsing library (Russija, Ellás, Suomi, DDR, Nederland, Spanien...)


Author: rdmpage
Comment: [~mblissett@gbif.org] Thanks for update, I'm guessing we may have the occasional issue like this as GBIF adds more data providers from countries for whom Latin script is not the norm. Identifiers for the win ;)
Created: 2016-02-01 17:50:43.638
Updated: 2016-02-01 17:50:43.638