Issue 16500

EBI dataset 'Geographically tagged INSDC sequences' has massive duplication of samples

16500
Reporter: rdmpage
Type: Bug
Summary: EBI dataset 'Geographically tagged INSDC sequences' has massive duplication of samples
Priority: Major
Status: Open
Created: 2014-09-25 14:24:26.421
Updated: 2014-11-27 08:32:29.166
        
Description: The EBI dataset 'Geographically tagged INSDC sequences'  has some fairly spectacular examples of duplication of occurrences. This results from treating individual sequences as occurrences. For example, there are 551,919 shotgun sequences from a single plant of Betula nana in Scotland (representing 92% of all Betula nana in GBIF)!

I don't know how many other genome projects are in the EBI dataset, but this one example of Betula nana already represents 10% of the entire dataset.

I suggest we treat genomic data differently, and use BioSample or BioProject ids for these, e.g. http://www.ncbi.nlm.nih.gov/bioproject/PRJEB576 so that a single organism that has its genome sequenced represents only a single occurrence. Unfortunately this means that a large chunk of newly minted occurrence URLs should be junked (or redirected to the single BioSample occurrence.

I've made this a "blocker" simply because if we are going to make a public announcement about this dataset it would be nice if the first thing people discovered wasn't half a million records in the Scottish Highlands.]]>


Author: jlegind@gbif.org
Created: 2014-11-26 14:26:16.686
Updated: 2014-11-26 14:26:16.686
        
Issue downgraded from blocker.

We recognize the validity of this concern, but at the moment GBIF is not structured to make these distinctions in the way we index data.
The individual sequences could be mapped to an extension that points back to a single occurrence as suggested, but it would have to be on the publisher side and there needs to be a discussion on how to handle this.


Author: rdmpage
Comment: Thanks Jan, I realise that this is an issue that will need some work on the publishing side. The mapping between sequence databases and GBiF is not straightforward, and also raises the more general issue of clustering duplicates within GBIF. 
Created: 2014-11-27 08:32:29.166
Updated: 2014-11-27 08:32:29.166