Issue 17214

Registry updates duplicating the endpoints

17214
Reporter: trobertson
Type: Bug
Summary: Registry updates duplicating the endpoints
Priority: Blocker
Status: Open
Created: 2015-02-12 11:30:50.925
Updated: 2015-02-24 16:22:05.405
        
Description: Please see http://api.gbif.org/v1/dataset/854f70cc-55e3-4af2-9417-0f47d6c7902d

The situation was that everything was correct, and then the IPT republished.  All endpoints are duplicated.  Here is the snippet:

{code}
endpoints: [
{
key: 61694,
type: "DWC_ARCHIVE",
url: "http://ipt.vertnet.org:8080/ipt/archive.do?r=ttu_mammals",
createdBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
modifiedBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
created: "2015-02-11T19:29:42.135+0000",
modified: "2015-02-11T19:29:42.135+0000",
machineTags: [ ]
},
{
key: 61693,
type: "DWC_ARCHIVE",
url: "http://ipt.vertnet.org:8080/ipt/archive.do?r=ttu_mammals",
createdBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
modifiedBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
created: "2015-02-11T19:29:41.996+0000",
modified: "2015-02-11T19:29:41.996+0000",
machineTags: [ ]
},
{
key: 61692,
type: "EML",
url: "http://ipt.vertnet.org:8080/ipt/eml.do?r=ttu_mammals",
createdBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
modifiedBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
created: "2015-02-11T19:29:41.932+0000",
modified: "2015-02-11T19:29:41.932+0000",
machineTags: [ ]
},
{
key: 61691,
type: "EML",
url: "http://ipt.vertnet.org:8080/ipt/eml.do?r=ttu_mammals",
createdBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
modifiedBy: "acac73b0-055d-11d8-b84f-b8a03c50a862",
created: "2015-02-11T19:29:41.763+0000",
modified: "2015-02-11T19:29:41.763+0000",
machineTags: [ ]
}
],
{code}

Note the timestamps: the duplicates are within 200msecs of each other, indicating that they are *unlikely* to be the result of 2 HTTP requests, but rather something more like a method being executed twice.]]>
    


Author: trobertson@gbif.org
Created: 2015-02-12 16:55:21.331
Updated: 2015-02-12 16:55:21.331
        
There are no registry-ws logs at 2015-02-11T19:29:41 (the created timestamp above) to indicate any issues.
There is an error at 18:02 and then nothing until 20:29
    


Author: fmendez@gbif.org
Created: 2015-02-12 20:27:37.993
Updated: 2015-02-12 20:27:37.993
        
Reviewing the code that creates/updates the endpoints, I found that if a dataset doesn't have a primary contacts, the service, potentially, could duplicate information:
See:
https://github.com/gbif/registry/blob/master/registry-ws/src/main/java/org/gbif/registry/ws/resources/legacy/LegacyDatasetResource.java#L117
and:
https://github.com/gbif/registry/blob/master/registry-ws/src/main/java/org/gbif/registry/ws/model/LegacyDataset.java#L591

[~kbraak@gbif.org] please review if what i'm saying makes sense
    


Author: larussell
Comment: Just wanted to let you know that I have found that in republishing all the VN resources I have found a few that have had a duplicated endpoint prior to me republishing.  But I have found that republishing clears up the duplicated endpoint.  MVZ Birds was one of these last night.  It now appears fine.  Also last night, I republished MVZ Mammals and the endpoint duplicated (only 101 records).  I waited a bit then republished again and the duplicate endpoint was removed.
Created: 2015-02-20 18:56:16.726
Updated: 2015-02-20 18:56:16.726


Author: trobertson@gbif.org
Created: 2015-02-24 16:22:05.405
Updated: 2015-02-24 16:22:05.405
        
Thanks for that [~larussell] - the republishing initiates the following procedure:
a) update in the registry
b) DOI issuing if applicable
c) metadata sync registry with EML (after 60 sec embargo) which sets the "published date" to be that in the EML
d) crawling (after 60 sec embargo)

Step a) should clear duplicates, so it is expected you'd see that, but we are yet to diagnose the cause of the race condition which causes the initial problem.