Issue 10904

Improve visibility of GenBank identifiers

10904
Reporter: kbraak
Type: Improvement
Summary: Improve visibility of GenBank identifiers
Priority: Major
Status: Open
Created: 2012-02-21 18:02:35.32
Updated: 2013-12-17 15:12:34.942
        
Description: A GenBank ID can come in via DwC 1.21 / 1.4 as GenBankNum or in the new DwC as associatedSequences

Sometimes it gets provided like: "GenBank: U34853.1","U34853.1", AF081981 etc

For an example of how they are currently displayed in our Portal, see the following records:

http://data.gbif.org/occurrences/147897204
http://data.gbif.org/occurrences/212403199

What should we do?

We should extract the unique identifier for the sequence number (accession number). There are many sequence banks, so I reckon the identifier must clearly identifier that it is in fact a GenBank ID, like so:

GenBank: U34853.1

If the provider gives us something ambiguous, we should flag it and report it to them so that they can fix it at the source. We should then be able to link out to the actual record in GenBank and rely on the fact that their identifiers are persistent. For example,

http://www.ncbi.nlm.nih.gov/nucleotide/U34853.1
]]>
    


Author: mdoering@gbif.org
Created: 2012-02-21 20:11:16.68
Updated: 2012-02-21 20:11:16.68
        
I don't think we have an Identifier parser yet, but it make be worthwhile.
Extracting a link from an anchor tag and even parsing out the id from some well known URL formats can be done.
For example the main DOI resolver, genbank entrez links, etc
    


Author: mdoering@gbif.org
Created: 2012-02-21 20:13:40.544
Updated: 2012-02-21 20:13:40.544
        
LSID resolvers too, see IdentifierType.guessType
http://code.google.com/p/gbif-common-resources/source/browse/gbif-common-api/trunk/src/main/java/org/gbif/api/model/vocabulary/IdentifierType.java