Issue 18384

Incorrect Hyperlink

18384
Reporter: feedback bot
Type: Feedback
Summary: Incorrect Hyperlink
Status: Reopened
Created: 2016-04-07 20:33:48.231
Updated: 2016-04-11 10:46:46.719
        
        
Description: Please take a look at the hyperlink for complete classification for this record, http://webcache.googleusercontent.com/search?q=cache:d0bjBgZx1FAJ:www.gbif.org/species/103050321+&cd=10&hl=en&ct=clnk&gl=ca

The hyperlink takes one to a different organism.  The original hyperlink is for a scarab beetle, the complete classification link takes one to a species of dragonfly.]]>
    


Author: mdoering@gbif.org
Comment: Indeed the id is now pointing to a different record. But this is for the NCBI checklist dataset, not the GBIF Backbone. And if a publisher does not keep his local identifiers stable the GBIF ids will also change like in this case. Not much we can do about this. dwc:taxonID needs to be stable in the published source for GBIF to keep the non backbone ids stable
Created: 2016-04-07 21:55:13.912
Updated: 2016-04-07 21:55:13.912


Author: rdmpage
Created: 2016-04-08 09:10:30.333
Updated: 2016-04-08 09:10:30.333
        
[~mdoering@gbif.org] I think there's something a little more complicated going on here. NCBI taxon ids are pretty stable so I'm surprised there is an issue like this. However, looking at the Darwin Core Archive for the NCBI taxonomy it looks like there are two sets of ids, one is the NCBI tax_id (stable) and the other is a sequential number with the prefix "e" (e1, e2, etc.). These ids are the synonyms, e.g.

e328876	Tetrathemis corduleformis	753953		misspelling
753953	Tetrathemis corduliformis		species		333463
e328877	Tetrathemis corduliformis Longfield, 1936	753953		authority

753953 is the tax_id. NCBI doesn't have distinct identifiers for synonyms (every name is linked to the same tax_id) so arbitrary ones have been created (e328876 and e328877, the 328876th and  e328877th synonym in this archive). Every time this file is generated (who does this?) the tax_ids are likely to be stable (NCBI does merge some occasionally, but not many) but the "e" ids will likely be different each time :(

GBIF generates taxa from each row in the NCBI Darwin Core archive, so some will have TAXONID tax_id and hence keep the same GBIF nub id, but those with "e" prefix will likely change. This is what happened here, the id "e327195" ends up being assigned the same GBIF nub id even if the taxon is different. Hence, the user is rightly confused about why this GBIF page has changed (if they'd picked http://www.gbif.org/species/104648044 they wouldn't have seen any changes).

Maybe we can reopen this issue, as the there's a problem with how the NCBI data is generated and parsed, and at the moment it pretty much guarantees that many NCBI taxa will have different nub ids with each update.

    


Author: mdoering@gbif.org
Created: 2016-04-08 10:16:04.318
Updated: 2016-04-08 10:16:04.318
        
The NCBI dwc archive is actually generated by us via some old php code Mike Giddens was contracted to write years ago:
https://github.com/gbif/dwca-adapters

Could you think of another way to generate stable ids for the NCBI synonyms? if the name is unique we could use that, but we are likely creating a few non unique records which render the archive invalid. Maybe use the accepted ncbi id and append the name?
    


Author: rdmpage
Comment: I think the accepted id and appended name makes sense as a simple fix that should keep the ids stable between harvests. The chance of someone stumbling across this bug is pretty remote, but clearly it happens.
Created: 2016-04-08 15:59:00.321
Updated: 2016-04-08 15:59:00.321