Issue 11427

Create new EML harvester to enable augmenting DwC-Archive datasets

11427
Reporter: kbraak
Type: Improvement
Summary: Create new EML harvester to enable augmenting DwC-Archive datasets
Priority: Critical
Resolution: Invalid
Status: Closed
Created: 2012-06-15 12:04:26.45
Updated: 2014-06-11 16:21:48.168
Resolved: 2012-11-05 17:16:14.314
        
Description: Currently we have an OAI-PMH harvester service. We use this to harvest KNB's OAI-PMH service for example, and populate the registry's metadata directory. The registry's metadata directory is read by the registry-metadata-service, which augments datasets with extra metadata, or serves up non-GBIF/metadata-only datasets.

Currently existing DwC-Archive GBIF datasets are never having their metadata augmented because their eml.xml files are never persisted to the registry's metadata directory. Therefore, something is needed to harvest all the DwC-Archives registered in the GBIF Registry, and persist their eml.xml files to the registry metadata directory.

This harvester should probably become a separate service (and separate project), but something that can be run at the same time as the OAI-PMH harvester. It should listen for new registrations, and poll EML access points for last modified dates, harvesting anew when changes are detected. It can probably borrow some of the existing code from the DwCArhiveMetadataHandler in the harvest-service project, which is currently responsible for harvesting eml.xml files and parsing them using the libraries from DwCA-reader project.

To make things more complicated, not all DwC-Archive datasets have a separate EML access point. In this case, the eml.xml file can exist inside the compressed archive, meaning that the archive itself must be downloaded, expanded, and the eml.xml file harvested finally.

Furthermore, it should clear the cache for a dataset every time that a new version of the dataset's eml.xml is persisted.

******

So that people are aware, the Super HIT will continue to process DwC-Archive Dataset's eml files, parse them, and synchronize the extra metadata with the GBIF Data Portal. That means for the time being, the extra metadata from the eml.xml files will only exist in the Data Portal, never in the Registry.  ]]>


Author: mdoering@gbif.org
Created: 2012-06-15 12:32:49.869
Updated: 2012-06-15 12:32:49.869
        
Can we think this through in light with issue CLB-79, messaging and how we can avoid pulling large dwc archives twice when harvesting data and metadata?
Ideal scenario springing to my mind is to only split harvesting and indexing with the single harvester being responsible for pulling both metadata and data if available. Also that single harvester should be responsible for pulling all datasets, regardless of occurrence or checklist.


Author: kbraak@gbif.org
Comment: Looks like CRAWLER-19 will be the authoritative issue coordinating this task.
Created: 2012-10-26 17:32:35.196
Updated: 2012-10-26 17:32:35.196


Author: mdoering@gbif.org
Comment: superceded by CRAWLER-42
Created: 2012-11-05 17:16:14.352
Updated: 2012-11-05 17:16:14.352