Issue 14013

Handling of duplicate datasets: need workflow clarification for marking and indexing

14013
Reporter: ahahn
Assignee: trobertson
Type: Feedback
Summary: Handling of duplicate datasets: need workflow clarification for marking and indexing
Priority: Major
Status: Open
Created: 2013-09-20 11:54:07.253
Updated: 2013-12-17 15:45:09.77
        
Description: Registry-side:
a) registry migration: the current registry contains tags, name "isIgnored", that mark datasets that should possibly remain registered, but not indexed (e.g. 839e3958-f762-11e1-a439-00145eb45e9a). Tags are migrated into machine tags during registry migration.
b) The new registry model contains dataset.duplicate_of_dataset_key to explicitly connect duplicate datasets. As this information does not exist in the old registry in any structured form, it cannot be migrated automatically.

Indexing:
The HIT so far evaluated the "isIgnored" tag to omit indexing of registered resources. If the new crawler is to replicate this behaviour, it would currently have to look both at the migrated machine tag, and also at the dataset.duplicate_of_dataset_key attribute. For duplicates spotted in future, registry editors will not be able to add machine tags (I assume?), but to reference duplicate datasets through the attribute.

Questions:
- should the crawler have to look in both places? Will such "isIgnored" machine tags ever be added in future, or are they pure migration legacy?
- if not, should isIgnored tags be translated into duplicate_of_dataset_key entries for the appropriate datasets? Proposal: do this manually or through a few update statements in the migration script - it only concerns a handful of datasets. The crawler would then only look at this attribute to decide on excluded datasets.
- does the new registry admin interface already allow to reference duplicate datasets onto each other, i.e. set the duplicate_of dataset_key? If not, could this be added?
]]>


Author: ahahn@gbif.org
Comment: select t.name,t.value,a.id,a.name,a.uuid from tag t join agent a on t.agent_id=a.id where t.name = 'isIgnored' and deleted is null;
Created: 2013-09-20 12:18:27.292
Updated: 2013-09-20 12:18:27.292