Issue 10904
Improve visibility of GenBank identifiers
10904
Reporter: kbraak
Type: Improvement
Summary: Improve visibility of GenBank identifiers
Priority: Major
Status: Open
Created: 2012-02-21 18:02:35.32
Updated: 2013-12-17 15:12:34.942
Description: A GenBank ID can come in via DwC 1.21 / 1.4 as GenBankNum or in the new DwC as associatedSequences
Sometimes it gets provided like: "GenBank: U34853.1","U34853.1", AF081981 etc
For an example of how they are currently displayed in our Portal, see the following records:
http://data.gbif.org/occurrences/147897204
http://data.gbif.org/occurrences/212403199
What should we do?
We should extract the unique identifier for the sequence number (accession number). There are many sequence banks, so I reckon the identifier must clearly identifier that it is in fact a GenBank ID, like so:
GenBank: U34853.1
If the provider gives us something ambiguous, we should flag it and report it to them so that they can fix it at the source. We should then be able to link out to the actual record in GenBank and rely on the fact that their identifiers are persistent. For example,
http://www.ncbi.nlm.nih.gov/nucleotide/U34853.1
]]>
Author: mdoering@gbif.org
Created: 2012-02-21 20:11:16.68
Updated: 2012-02-21 20:11:16.68
I don't think we have an Identifier parser yet, but it make be worthwhile.
Extracting a link from an anchor tag and even parsing out the id from some well known URL formats can be done.
For example the main DOI resolver, genbank entrez links, etc