Issue 16671

Populate dataset DOI from pushed EML

16671
Reporter: trobertson
Assignee: mdoering
Type: Task
Summary: Populate dataset DOI from pushed EML
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2014-11-26 11:48:26.025
Updated: 2015-12-15 15:56:44.914
Resolved: 2014-12-09 14:23:43.21
        
Description: For any EML associated with a dataset, populate the DOI on the dataset table.

PARSE EML rules:

If dataset element has an id attribute, and it's a doi, use that.
else if citation element with identifier element, and it's a doi, use that.
else check alternative identifiers
  if just one doi, use that.
  if multiple dois, choose none.

*DOI update logic*:
 - If this is a new dataset and no DOI is found in EML the metasync should use the DOI client library to create a new gbif:DOI
 - If the dataset did already have a different DOI, it should be moved to the identifiers extension and updated with the new one found
 - if existing and found DOI are the same nothing needs to happen
- if dataset had existing DOI and we haven't found one in the eml, nothing needs to happen
]]>
    


Author: mdoering@gbif.org
Comment: The primary place to keep a dataset DOI still needs to be discussed incl the EML mailing list
Created: 2014-11-27 10:43:58.109
Updated: 2014-11-27 10:43:58.109


Author: trobertson@gbif.org
Created: 2014-12-04 09:50:45.133
Updated: 2014-12-04 09:52:34.206
        
The decision on this is that GBIF will continue to use the current identifiers in the packageID, and then use a DOI in the alternateIdentifier.

There is no perfect way to do this, but what we learnt was:
 - while not documented as such, the dataset "id" attribute is intended by the EML authors to be document scope exclusively (Matt Jones)
 - alternateIdentifier is intended to be an alternative identifier for the packageId, but not the exclusive one
 - packageId is intended to identify an immutable "package" in system scope

The IPT manages immutable versions, and GBIF's DOI strategy is to support a DOI for a living dataset to enable it to be cited consistently, and therefore easing the ability to track use.  Thus, the DOI is not suitable for putting in the packageId as it will not reference a specific version nor will it identify an immutable package uniquely in system scope.

The GBIF dataset APIs (legacy and v1) will support the ability to strongly declare the DOI on creation time.

When reading extended metadata about a dataset (e.g. EML) GBIF will put in place rules to recognise a DOI in:
 - the packageID (even though we won't populate it in the IPT)
 - the first found in the alternateIdentifiers
 - the "identifier" attribute of the eml/additionalMetadata/citation
Should one be found, the current DOI on the dataset will be archived (put into the identifiers table as an alternate) and the one extracted from the EML will be used.


    


Author: mdoering@gbif.org
Created: 2014-12-08 12:29:05.207
Updated: 2014-12-08 12:29:05.207
        
There is nothing to be done on the crawler which pushes a new EML document to the registry AFTER a dataset has been created already.

The registry method that receives an EML metadata document needs to be updated to extract the DOIs, this is work to be done under this issue 
    


Author: mdoering@gbif.org
Comment: https://github.com/gbif/registry/commit/b251f0c0c7d27593d8397cc56d57d0caaf0b596f#diff-8da43f195de37e9dc6a3c89236a12885R464
Created: 2014-12-09 14:23:43.273
Updated: 2014-12-09 14:23:43.273