Issue 17105

Populate dataset DOI from citation identifier in EML fails

17105
Reporter: kbraak
Type: Bug
Summary: Populate dataset DOI from citation identifier in EML fails
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2015-02-06 10:19:07.143
Updated: 2015-02-06 17:28:45.058
Resolved: 2015-02-06 17:28:45.031
        
Description: The parse EML logic is detailed in POR-2545

Apparently it is not being followed, as this dataset has a DOI in the citation identifier, but GBIF assigns it a new DOI anyways:

http://data.canadensys.net/ipt/eml.do?r=northern-beetle-specimens



Unfortunately, it was actually assigned multiple DOIs:

"10.15468/ccyqgt"
"10.15468/ana2y3"
"10.15468/yh8mxf"
"10.15468/zjdqrd"
"10.15468/jk9du3"
"10.15468/lpm4hu"
"10.15468/spywd2"
"10.15468/1q6wil"
"10.15468/xuyrwl"
"10.15468/k8ya2a"
"10.15468/vfdk3d"
"10.15468/6gt9ed"

Please fix by:

* Updating GBIF Registry to use their DOI instead of our DOI
* Deactivate the GBIF DOIs in DataCite that were mistakenly assigned to this dataset
* Write the publisher back telling them of the actions taken to fix the problem
]]>
    


Author: kbraak@gbif.org
Comment: All GBIF DOIs assigned have now been deactivated in DataCite through the DataCite Metadata Store interface.
Created: 2015-02-06 12:17:46.822
Updated: 2015-02-06 12:17:46.822


Author: mdoering@gbif.org
Created: 2015-02-06 13:25:25.938
Updated: 2015-02-06 13:25:25.938
        
As we track all GBIF dois in the registry table gbif_doi we need to also do it there!
We want some interface for admins to change a DOI status that then takes care of changing them both in our registry and datacite
    


Author: mdoering@gbif.org
Comment: Has anyone understood why this dataset had so many DOIs assigned?
Created: 2015-02-06 13:26:01.606
Updated: 2015-02-06 13:26:01.606


Author: jlegind@gbif.org
Created: 2015-02-06 13:36:58.432
Updated: 2015-02-06 13:36:58.432
        
Yesterday I noticed that this dataset appeared 11 times in the registry (each having its own doi) and later David Shorthouse mentioned that he wanted all entries removed - which I did.

I don't know why this resource appear as distinct datasets.
    


Author: mdoering@gbif.org
Created: 2015-02-06 13:40:56.64
Updated: 2015-02-06 13:40:56.64
        
the dataset was published 4 times, all during this week after we released the first DOI code:

http://data.canadensys.net/ipt/resource.do?r=northern-beetle-specimens&v=4
http://data.canadensys.net/ipt/resource.do?r=northern-beetle-specimens&v=3
http://data.canadensys.net/ipt/resource.do?r=northern-beetle-specimens&v=2
http://data.canadensys.net/ipt/resource.do?r=northern-beetle-specimens&v=1

As it is an old IPT it does not submit the DOI on dataset creation time, so it should get an initial DOI assigned by GBIF.
Only when the data is indexed the EML is actually read and the source DOI could have been extracted.

I suspect this is an issue with the legacy IPT services...
    


Author: mdoering@gbif.org
Created: 2015-02-06 13:58:01.421
Updated: 2015-02-06 13:58:01.421
        
Any EML is only read when we index during metasync. Before that, at the time of registering, we will issue a GBIF doi even if the source had one in their EML. Only with the new IPT (to be released still) this is different as it pushed the DOI with the register call into our registry.

So if crawling is off or blocked it will take a while for the source DOI to get through, this is expected behavior!
That we have issued lots of DOIs is not expected…
    


Author: mdoering@gbif.org
Created: 2015-02-06 14:37:28.975
Updated: 2015-02-06 14:37:28.975
        
Our code actually seems fine. I now think this might have been related to the http client bug I fixed yesterday and got into prod by 8pm only
Before that he should have seen timeouts, so he maybe tried various times and never had a UUID key given back so we kept creating new datasets
    


Author: mdoering@gbif.org
Created: 2015-02-06 16:54:38.889
Updated: 2015-02-06 16:54:38.889
        
I have verified that when an IPT tries to register against an old regitry with the http client bug it a) does get a timeout so apparently nothing happens on the IPT and b) the call eventually gets through in our registry, thus creating a new dataset every time!

When I run the IPT against the latest registry it registered fine in the first call