Entries marked as deleted in the current HIT: make sure the information is transferred to the new version
Summary: Entries marked as deleted in the current HIT: make sure the information is transferred to the new version
Description: The HIT allows to flag operators as deleted (available filter: http://hit.gbif.org/datasource/list.html?filter=deleted). The purpose of that is to maintain a history, so that operators once deleted do not get created again, but also to allow to re-animate operators if the deletion turns out to have been accidental/erroneous. Deleted operators include those of datasets that were removed (e.g. in migrating from one publishing package to another), but also flags for resources that continue to be served in duplicates by a single publisher or across the network. The latter type will not disappear from the registry, but rather be cross-referenced eventually. We need to ensure that past decisions on which of the two to remove are honored, and that operators and datasets do not reappear after migration to the dataset-aware registry. As the deleted-list in the HIT is the only existing documentation of these decisions, the list needs to be ported to the new HIT version: it is next to impossible to reproduce the history of such deletions otherwise.
Created: 2012-06-26 14:09:12.864
Updated: 2012-10-15 10:37:45.231
Resolved: 2012-10-15 10:37:45.204
Created: 2012-06-30 21:12:37.916
Updated: 2012-06-30 21:12:37.916
I wanted to clear this issue up before leaving.
I refer to such deleted entries as ignored, whereby they can still exist in the Registry, but we want to avoid ever harvesting and indexing them.
In total, I found 99 ignored datasets, and 4 technical installations that should be ignored. Please see the spreadsheet attached, with 2 sheets: 1 for the technical installations, and the other for the datasets.
This information was generated from a script ( http://code.google.com/p/gbif-indexingtoolkit/source/browse/trunk/harvest-service/src/main/java/org/gbif/harvest/registry/IgnoredBioDatasourceSynchronizer.java ) and then manually curated to remove duplicates. The script works by iterating over the OLD HIT's 1104 deleted BioDatasources, and checks if their related technical installation or dataset still existed in the Dataset-Aware Registry. If it does, it is written to an output file.
Of course the OLD HIT's list of deleted BioDatasources may change, and so too might the Dataset-Aware Registry, so the script might have to be re-run later.
I would recommend going over the spreadsheet and double checking that its entries indeed should be ignored in the Super HIT. I did some validation on a dozen or so, and I am fairly confident in the script. If changes need to be made to the script, just let me know.
For anyone else interested in running it, just configure the oldHit params in the application.properties file, and run the IgnoredBioDatasourceSynchronizerTest.testSync()
I am going to mark the issue as resolved, assuming if the Super HIT goes into production, they can be manually flagged as deleted. I could write something that automatically does this later if you want - please just open another issue. Thanks