Issue 18507

Duplicated CHECKLIST datasets from GBIF Backbone Taxonomy

18507
Reporter: cgendreau
Assignee: jlegind
Type: Bug
Summary: Duplicated CHECKLIST datasets from GBIF Backbone Taxonomy
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2016-05-30 16:43:13.582
Updated: 2016-07-04 12:42:56.395
Resolved: 2016-07-04 12:42:56.355
        
Description: Not sure what is the workflow and even if this is supposed to happen but the duplicated datasets from GBIF Backbone Taxonomy may introduce an issue from a DOI perspective when the DOI is assigned by GBIF.

e.g.
http://www.gbif.org/dataset/62740d38-7baa-4811-a622-fdd73d9e7140
http://www.gbif.org/dataset/e8ae399c-693f-49b2-b508-cbaceea819ad

To which page the DOI should resolve? If we need a duplicated datasets for the GBIF Backbone Taxonomy we should probably assign them different DOI.

]]>
    


Author: cgendreau
Created: 2016-06-09 10:36:54.671
Updated: 2016-06-09 10:36:54.671
        
From Registry 2.51 (which is currently running on Production) the constituents of the Backbone Taxonomy are excluded from DOI business (per configuration).

Which means that changes to dataset e8ae399c-693f-49b2-b508-cbaceea819ad (the GBIF version) will not affect the DOI entry at Datacite while changes to 62740d38-7baa-4811-a622-fdd73d9e7140 (the original one from Plazi) will update the metadata at Datacite.

This still needs to be validated by [~mdoering@gbif.org].
    


Author: mdoering@gbif.org
Comment: I am surprised to see Backbone dataset constituents at all in our registry. It was not planned that these exist but rather the backbone should be referencing the original datasets directly from ChecklistBank. Sth strange happened apparently at nub build import time as all constituents are from  2016-05-13 11:43
Created: 2016-07-01 15:26:56.946
Updated: 2016-07-01 15:26:56.946


Author: mdoering@gbif.org
Created: 2016-07-01 15:40:35.0
Updated: 2016-07-01 15:40:49.679
        
None of these constituent datasets have an endpoint, so none can be indexed. I would suggest to remove them all from the registry.

As the backbone has a dwca endpoint with constituents EML inside I suppose we indexed the backbone and synced its metadata. AFAIK only at the stage of checklist normalization the nub is blocked. [~jlegind@gbif.org] can we add it to the list of rejected datasets?
    


Author: mdoering@gbif.org
Comment: blocked in crawler https://github.com/gbif/crawler/commit/4809b42adbaec15d6b2f955aa52cca678173817b
Created: 2016-07-01 15:52:22.828
Updated: 2016-07-01 15:52:22.828


Author: mdoering@gbif.org
Created: 2016-07-01 16:25:10.682
Updated: 2016-07-01 16:25:10.682
        
there are 3978 redundant datasets created in the registry as nub constituents:
select key from dataset where parent_dataset_key='d7dddbf4-2cf0-4f39-9b2a-bb099caae36c';

    


Author: mdoering@gbif.org
Comment: deleted them all via the registry webservice and added the backbone UUID to UAT & PROD crawl scheduler configs
Created: 2016-07-04 12:42:34.672
Updated: 2016-07-04 12:42:34.672