Issue 11400

registry-sync: avoid deleting Datasets during synchronization

11400
Reporter: kbraak
Assignee: fmendez
Type: Bug
Summary: registry-sync: avoid deleting Datasets during synchronization
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-06-08 18:18:28.481
Updated: 2013-12-16 17:50:49.138
Resolved: 2012-06-11 16:50:27.82
TimeEstimate: 0
TimeSpent: 7200
        
Description: I ran a metadata synchronization on the BioCASE Technical Installation (f861f398-abee-11e1-9773-0024e8565763) for BGBM using my setup on kyle.gbif.org (b7g4), connecting to registry db (mogo.registry_uat), exposed via kyle.gbif.org:8080/registry-ws

I noticed the following logs in the middle of the running operation:

    '> 17:56:46.669 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 417b15ca-b179-11e1-9773-0024e8565763
    '> 17:56:46.678 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok
    '> 17:56:46.699 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 45f8cc00-b179-11e1-9773-0024e8565763
    '> 17:56:46.706 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok
    '> 17:56:46.746 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 7eecaf06-b181-11e1-9773-0024e8565763
    '> 17:56:46.753 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok
    '> 17:56:46.772 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 7f30c40c-b181-11e1-9773-0024e8565763
    '> 17:56:46.779 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok
    '> 17:56:46.798 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 7fabf7d0-b181-11e1-9773-0024e8565763
    '> 17:56:46.805 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok
    '> 17:56:46.824 o.g.r.s.s.MetadataSynchronizerBase Deleting dataset 7fd09176-b181-11e1-9773-0024e8565763
    '> 17:56:46.832 o.gbif.messaging.amqp.MessagingNode Channel created: exchange.declare-ok

Looking at one in particular, I saw that the same Dataset got created in an earlier run at 2012-06-08 16:50:14, was deleted for some reason during this run at 2012-06-08 17:56:46, and then recreated during the same run at 2012-06-08 17:58:01

This particular dataset doesn't hasn't successfully been indexed yet. You can see it here on the HIT: http://hit.gbif.org/datasource/list.html?filter=datasource&value=51095c5f

I have a feeling, your minimun requirements for identifying whether a dataset exists already or not are too strict. If we can get the dataset title, great. Any future updates operate on that existing dataset.

Furthermore, we should follow the rule that if the URL responds at all, and especially with some recognisable protocol response, it is clearly not intended to be deleted, so don't remove it, but make it easy for someone to follow up. Instead of deleting, it is better to just log information on such instances, and have someone look at them.



]]>
    


Author: fmendez@gbif.org
Created: 2012-06-11 16:50:12.084
Updated: 2012-06-11 16:50:12.084
        
Some deletions could happen due that test databases are not synchronized, i.e.: datasets were created in portal db that doesn't contains a related service or technical installation in the registry db (or vice versa). The following query can help to spot those cases:

select a.id from agent a
left outer join identifier i on i.agent_id = a.id and i.identifier_type = 2009
where i.id is null and a.deleted is null and a.type = 14020

2009 = GBIF_PORTAL
14020 = DATASET entity type

In general, we have to validate those dataset which doesn't have a portal identifier.