Issue 14916

Metasync the world resurrected xml datasets that were migrated to DwC archive

14916
Reporter: jlegind
Type: Bug
Summary: Metasync the world resurrected xml datasets that were migrated to DwC archive
Priority: Major
Status: Open
Created: 2014-01-23 10:31:41.796
Updated: 2016-08-09 12:26:13.608
        
Description: On 17-12-2013 at least 56 datasets of type DIGIRMANIS and a large number of DIGIR datasets were created in the registry even though most (if not all) of these have already been migrated to DwC-a.

I think that once I have done the logical deletion of these illegal datasets, the deleted timestamp should prevent them from appearing again.
Kyle suggested setting the "Duplicate of dataset" reference, but that is extremely time consuming.

Here is an example:

Resurrected DiGIR_MANIS dataset created 2013-12-17T14:31:50.016+0000 http://registry.gbif.org/web/index.html#/dataset/5d22334e-7b09-43df-af00-f737b09b2e84

The valid DwC dataset created 2009-07-29T12:51:54.000+0000 http://registry.gbif.org/web/index.html#/dataset/96c93a1e-f762-11e1-a439-00145eb45e9a

This issue blocks efforts to create statistics.]]>
    


Author: omeyn@gbif.org
Comment: This needs fixing before next metasync the world but isn't blocking stats anymore.
Created: 2014-02-10 10:39:40.34
Updated: 2014-02-10 10:39:40.34


Author: mblissett
Created: 2016-02-02 11:38:38.94
Updated: 2016-02-02 11:38:38.94
        
I've added a comment to the metasync-everything script referring to this issue, so we don't accidentally create the obsolete datasets a third time...

    


Author: kbraak@gbif.org
Created: 2016-08-09 12:26:13.608
Updated: 2016-08-09 12:26:13.608
        
[~jlegind@gbif.org] [~cgendreau]

Please note that I have made an [improvement|https://github.com/gbif/crawler/commit/a9dc806001ffca3e3c97c6f1e43c384b9d8848c5] to the [metasync-everything cli|https://github.com/gbif/gbif-configuration/blob/master/cli/common/util/metasync-everything], which will skip syncing installations that don't serve any datasets.

This will avoid resurrecting deleted datasets from installations that have had ALL their datasets migrated to another installation (e.g. from DiGIR to IPT). Unfortunately, this will not avoid resurrecting deleted datasets from installations that only have some of the their datasets migrated to another installation.

Running this script locally, 152 installations that don't serve any datasets were skipped. This includes the DiGIR MANIS installation referenced in the issue description above.

As long as we always remember to set the "Duplicate of dataset" flag on migrated datasets it should be fine to run this script in production.

Running this script locally, I hope we can make a list resurrected datasets and then try to set the "Duplicate of dataset" on as many of them as we can.