Issue 11496

Entries marked as deleted in the current HIT: make sure the information is transferred to the new version

11496
Reporter: ahahn
Assignee: kbraak
Type: Task
Summary: Entries marked as deleted in the current HIT: make sure the information is transferred to the new version
Description: The HIT allows to flag operators as deleted (available filter: http://hit.gbif.org/datasource/list.html?filter=deleted). The purpose of that is to maintain a history, so that operators once deleted do not get created again, but also to allow to re-animate operators if the deletion turns out to have been accidental/erroneous. Deleted operators include those of datasets that were removed (e.g. in migrating from one publishing package to another), but also flags for resources that continue to be served in duplicates by a single publisher or across the network. The latter type will not disappear from the registry, but rather be cross-referenced eventually. We need to ensure that past decisions on which of the two to remove are honored, and that operators and datasets do not reappear after migration to the dataset-aware registry. As the deleted-list in the HIT is the only existing documentation of these decisions, the list needs to be ported to the new HIT version: it is next to impossible to reproduce the history of such deletions otherwise. 
Priority: Critical
Resolution: WontFix
Status: Closed
Created: 2012-06-26 14:09:12.864
Updated: 2012-10-15 10:37:45.231
Resolved: 2012-10-15 10:37:45.204

Attachment ignored-registrations-june30-2012.xlsx
Attachment ignored_sep19_2012.xlsx

Author: kbraak@gbif.org
Created: 2012-06-30 21:12:37.916
Updated: 2012-06-30 21:12:37.916

I wanted to clear this issue up before leaving.

I refer to such deleted entries as ignored, whereby they can still exist in the Registry, but we want to avoid ever harvesting and indexing them.

In total, I found 99 ignored datasets, and 4 technical installations that should be ignored. Please see the spreadsheet attached, with 2 sheets: 1 for the technical installations, and the other for the datasets.

This information was generated from a script ( http://code.google.com/p/gbif-indexingtoolkit/source/browse/trunk/harvest-service/src/main/java/org/gbif/harvest/registry/IgnoredBioDatasourceSynchronizer.java ) and then manually curated to remove duplicates. The script works by iterating over the OLD HIT's 1104 deleted BioDatasources, and checks if their related technical installation or dataset still existed in the Dataset-Aware Registry. If it does, it is written to an output file.

Of course the OLD HIT's list of deleted BioDatasources may change, and so too might the Dataset-Aware Registry, so the script might have to be re-run later.

I would recommend going over the spreadsheet and double checking that its entries indeed should be ignored in the Super HIT. I did some validation on a dozen or so, and I am fairly confident in the script. If changes need to be made to the script, just let me know.

For anyone else interested in running it, just configure the oldHit params in the application.properties file, and run the IgnoredBioDatasourceSynchronizerTest.testSync()

I am going to mark the issue as resolved, assuming if the Super HIT goes into production, they can be manually flagged as deleted. I could write something that automatically does this later if you want - please just open another issue. Thanks


Author: kbraak@gbif.org
Comment: Ignored list generated from OLD HIT on June 30, 2012. Closing issue.
Created: 2012-06-30 21:14:00.668
Updated: 2012-06-30 21:14:00.668


Author: ahahn@gbif.org
Comment: This issue is not closed yet - the deleted list from the old HIT is not available at http://hit.gbif.org/datasource/list.html?filter=deleted. One previously deleted dataset has now already been re-added to the index db. Flagging these deletions manually is going to take a lot of time, as we have to go through a metadata update for each publisher, then go through and try to identify the correct datasets (uuid does not help there). There is a lot of potential to mess up the portal this way, because something gets overlooked in this week-long process. Is there really no way to import this?
Created: 2012-09-12 17:04:35.64
Updated: 2012-09-12 17:10:41.652


Author: kbraak@gbif.org
Comment: Once the tag predicate has been added to the GBIF-API, I will work on tagging the datasets and technical installations in the Registry that we have chosen to ignore.
Created: 2012-09-13 15:37:04.438
Updated: 2012-09-13 15:37:04.438


Author: kbraak@gbif.org
Comment: The tag predicate has now been added to the GBIF-API, and the Super-HIT has been adapted to recognize it and not create BioDatasources that have been tagged as ignore. The last step now, is to tag datasets in the registry that should be ignored. 
Created: 2012-09-17 17:06:33.481
Updated: 2012-09-17 17:06:33.481


Author: kbraak@gbif.org
Created: 2012-10-12 14:52:36.327
Updated: 2012-10-12 14:52:36.327
        
Attached is the same 'ignored list' that I supplied back in September. It is for our backup easy reference here.

Remember that this version included the endpoint type to help determine if the dataset or installations should in fact be ignored.

I think we agreed the list would have to be gone through manually, instead of programmatically updated, therefore I ask if we can close this issue?


Author: ahahn@gbif.org
Comment: Thanks - closed as "won't fix", to be taken over by DM-27 instead (manual fix).
Created: 2012-10-15 10:37:45.229
Updated: 2012-10-15 10:37:45.229