Issue 18273

Broken links and unrecognised values in Brazilian dataset

18273
Reporter: rdmpage
Type: Feedback
Summary: Broken links and unrecognised values in Brazilian dataset
Priority: Unassessed
Status: Open
Created: 2016-02-28 21:09:54.754
Updated: 2016-02-29 13:35:32.353
        
Description: The Brazilian Flora dataset http://www.gbif.org/dataset/aacd816d-662c-49d2-ad1a-97e66e2a2908 has some problems.

It looks like every link shown in the "Overview" section for each taxon is broken, e.g. for http://www.gbif.org/species/114811353 http://reflora.jbrj.gov.br/jabot/listaBrasil/FichaPublicaTaxonUC/FichaPublicaTaxonUC.do?id=FB115031leads to a 404. I'm guessing that the site has been reorganised since these links were stored in the database.

The values for some standard fields are in Portuguese, and hence aren't recognised. For example, 63,932 have nomenclatural status "unknown" http://www.gbif.org/species/search?dataset_key=aacd816d-662c-49d2-ad1a-97e66e2a2908&issue=NOMENCLATURAL_STATUS_INVALID, 48,401 have taxonomic status unknown http://www.gbif.org/species/search?dataset_key=aacd816d-662c-49d2-ad1a-97e66e2a2908&issue=TAXONOMIC_STATUS_INVALID

This is likely because the source has values such as "NOME_ACEITO" for the field "DWC:TAXONOMICSTATUS"]]>
    


Author: mdoering@gbif.org
Created: 2016-02-29 10:36:28.287
Updated: 2016-02-29 10:36:28.287
        
This is strange, I just indexed some last week and the few links I tested were fine then. I also updated our parsers adding the missing portuguese status & rank values:
https://github.com/gbif/parsers/commit/cd17665b11e521b9e4adcfdac441afb1a06ecd59#diff-12a813829089b209f064d22b079aa438R10

This is a link of a currently working species:
http://floradobrasil.jbrj.gov.br/reflora/listaBrasil/FichaPublicaTaxonUC/FichaPublicaTaxonUC.do?id=FB4248

Using this format the link works:
http://floradobrasil.jbrj.gov.br/reflora/listaBrasil/FichaPublicaTaxonUC/FichaPublicaTaxonUC.do?id=FB171
The one from the dwca does not:
http://reflora.jbrj.gov.br/jabot/listaBrasil/FichaPublicaTaxonUC/FichaPublicaTaxonUC.do?id=FB171

Publisher contacted.
    


Author: mdoering@gbif.org
Created: 2016-02-29 13:14:56.227
Updated: 2016-02-29 13:15:32.561
        
Rod, I have reindexed the flora and most parsing problems have vanished:
http://www.gbif.org/dataset/aacd816d-662c-49d2-ad1a-97e66e2a2908/stats

The vernacular name ones are also bad code which I just fixed but havent deployed yet
    


Author: rdmpage
Created: 2016-02-29 13:25:42.407
Updated: 2016-02-29 13:25:42.407
        
Hi Markus,

Hmmm, but the stats page shows massive failure to match the backbone ?!

http://api.gbif.org/v1/dataset/aacd816d-662c-49d2-ad1a-97e66e2a2908/metrics

{"key":8801,"datasetKey":"aacd816d-662c-49d2-ad1a-97e66e2a2908","usagesCount":116529,"synonymsCount":55666,"distinctNamesCount":114751,"nubMatchingCount":6435,"colMatchingCount":5733,"nubCoveragePct":5,"colCoveragePct":4,"countByConstituent":{},"countByKingdom":{"FUNGI":6018,"INCERTAE_SEDIS":9},"countByRank":{"SPECIES":48932,"GENUS":6318,"VARIETY":4137,"SUBSPECIES":863,"FAMILY":401,"ORDER":103,"CLASS":31,"TRIBE":28,"FORM":25,"PHYLUM":14,"SUBFAMILY":11},"countNamesByLanguage":{"PORTUGUESE":7467,"SPANISH":162,"ENGLISH":28,"DUTCH":3,"FRENCH":3},"countExtRecordsByExtension":{"DISTRIBUTION":180009,"REFERENCE":74590,"SPECIES_PROFILE":55491,"VERNACULAR_NAME":7803,"DESCRIPTION":0,"IDENTIFIER":0,"MULTIMEDIA":0,"TYPES_AND_SPECIMEN":0},"countByOrigin":{"SOURCE":114708,"MISSING_ACCEPTED":1821},"countByIssue":{"BACKBONE_MATCH_NONE":110094,"NOMENCLATURAL_STATUS_INVALID":72256,"VERNACULAR_NAME_INVALID":3013,"ACCEPTED_NAME_MISSING":1765,"DISTRIBUTION_INVALID":966,"ORIGINAL_NAME_USAGE_ID_INVALID":106,"ACCEPTED_NAME_USAGE_ID_INVALID":56,"CHAINED_SYNOYM":1},"otherCount":{},"created":"2016-02-29T11:38:24.308+0000","downloaded":"2016-02-29T11:21:16.124+0000"}
    


Author: mdoering@gbif.org
Created: 2016-02-29 13:35:00.008
Updated: 2016-02-29 13:35:32.348
        
Yes, sth wrong in our indexing code: http://dev.gbif.org/issues/browse/POR-3057
No idea yet what that is ... there were 17000 names or so previously genuinely not found in our backbone