Issue 11984

Add support for DwC-A

11984
Reporter: lfrancke
Assignee: lfrancke
Type: Epic
Summary: Add support for DwC-A
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-10-05 12:08:39.101
Updated: 2013-12-17 15:46:39.112
Resolved: 2013-12-17 15:33:12.852
        
Description: Support for DwC-A is needed throughout our new Crawling and Occurrence infrastructure. To avoid duplication of work we need a few components for this to work.

* Downloader
** So that an archive is downloaded only once during the whole lifecycle
** We need to figure out a place (repository) to store this and if it's all going to live on one machine or if we're distributing it
* Validator
** One exists but according to Markus it should be mostly rewritten
* Metadata Extractor
** The metadata synchronizer currently doesn't work with DwC-A and we can easily extract metadata while we have the archive downloaded anyway
* Fragmenting
* Occurrence processing
* Cleaning of the DwC-A repository

This is an umbrella issue to keep an overview over all tasks related to DwC-A support.]]>


Author: kbraak@gbif.org
Created: 2012-10-26 17:10:34.825
Updated: 2012-10-26 17:30:02.18
        
The Registry will be the authoritative source for contacts.

Currently, the registry-metadata-service augments a dataset with metadata coming from an eml.xml file, but there is nothing downloading these files yet - hence the need for a Downloader as mentioned in the issue description.

The registry metadata synchronizer and registry-metadata-service will have to be coordinated. For example, the synchronizer populates the registry with metadata from non-external DwC-Archives only, and the registry-metadata-service will be responsible for augmenting the metadata of external datasets only.


Author: lfrancke@gbif.org
Created: 2012-10-31 19:43:55.684
Updated: 2012-10-31 19:43:55.684
        
Open questions for me:

* Will this all be tied to a crawling workflow or do we want to use some of these things separately? (e.g. metadata update but not processing the occurrences or only validation)
* Related to previous question: The DwC-A Downloader could either announce every newly available on a public exchange or it can alternatively reply on a private reply queue (or both)
* Should we save DwC archives in an extracted form?
* Should we only do Metadata extraction and/or fragmenting when the validation result is positive?
* Is there any reason for us to keep the DwC archives after we've processed them or do we want to clean up immediately?
* I currently don't understand what the registry-metadata-service does at all so I need to figure that out especially how it relates to this processing
* What kind of things does the metadata synchronizer actually synchronize and is it (or the dwca-reader project) already capable of doing the relevant updates when given a DwC-A file or a metadata/eml file?

Author: kbraak@gbif.org
Created: 2012-11-01 12:25:17.072
Updated: 2012-11-01 12:25:17.072

1) Will this all be tied to a crawling workflow or do we want to use some of these things separately? (e.g. metadata update but not processing the occurrences or only validation)

Ideally, a downloader will download newly registered datasets, and also routinely iterate through all registered datasets and download those whose last modified date indicates our local copy needs to be replaced.

Such a downloader must also respond to individual requests. For example, if contact metadata has changed in an archive from the UK NBN, Andrea should be able to trigger a metadata update.

When Andrea triggers a metadata update through the harvesting GUI:

1) the request triggers the downloader service, requesting a new download of only the eml.xml file if possible
2) when 1) is finished, the request triggers the the metadata-synchronizer, that updates the metadata in the registry. At no point was the dataset's records reindexed.

I'd like to point out, that it is hard to distinguish if only the metadata has changed. At least in the IPT, there is no way to only publish only an update of metadata (eml.xml). Therefore, everything gets downloaded, and that's just the way it is right now.

2) Related to previous question: The DwC-A Downloader could either announce every newly available on a public exchange or it can alternatively reply on a private reply queue (or both)

I think once the Downloader detects a new registered resource, it

1) broadcasts that information on a public exchange as you suggest, and
2) the metadata-synchronizer is listening to that and updates the metadata in the registry, and
3) the crawler/parser is listening to that and indexes the records

3) Should we save DwC archives in an extracted form?

Disk space is always an issue. I think just like how we only store the .gz XML responses from harvesting, we should only store the DwC-A in their compressed form. It is then then the job of the metadata-synchronizer or crawler/parser to know where to look, uncompress the archive, do what they need to do, and then delete the uncompressed content.

4) Should we only do Metadata extraction and/or fragmenting when the validation result is positive?

The Super HIT requires the eml.xml to be parseable for example, to be able to extract the information about the dataset title, etc. If the DwC-A is invalid in the sense that its eml.xml can't be parsed, the metadata in the registry can't be augmented, and the problem needs to be logged. With regard to the occurrence data, I think we need to get better at validating the archive before indexing, something we don't do a lot of right now. That's how we got into problems with ebird, having to reindex them after finding out their lat/long columns were swapped.

5) Is there any reason for us to keep the DwC archives after we've processed them or do we want to clean up immediately?

I would say keep them. Maybe we have improved our metadata-synchronizer, so that it can parse an additional field from the metadata. It would be convenient to not have to download all the archives again just to add this additional field.

6)I currently don't understand what the registry-metadata-service does at all so I need to figure that out especially how it relates to this processing.

As we said, it populates the metadata in the registry for DiGIR, BioCASE, and TAPIR. It will have to do DwC-A also. As I mentioned, I don't think it should do external datasets, which should be the job of the registry-metadata-service.

7) What kind of things does the metadata synchronizer actually synchronize and is it (or the dwca-reader project) already capable of doing the relevant updates when given a DwC-A file or a metadata/eml file?

It needs to be extended to do DwC-As. For everybody's reference, the Super HIT parses the DwC-A metadata in this class: http://code.google.com/p/gbif-indexingtoolkit/source/browse/trunk/harvest-service/src/main/java/org/gbif/harvest/dwcarchive/DwcArchiveMetadataHandler.java


Author: lfrancke@gbif.org
Created: 2012-11-01 14:17:50.833
Updated: 2012-11-01 14:17:50.833
        
Thanks Kyle for your comments.

# So that means we are interested in doing Metadata updates without any further processing. A metadata update doesn't tell us if there is any new data in the file though, does it? Something could take a hash of just the content files of a DwC-A and see if those have changed to then decide if we want to process it or not. So I think the "crawler" could always run for each download.
# Agreed
# Yeah disk space is an issue but things like the validator and the metadata updater etc. could all work on a single DwC-A at the same time and they all need to extract it (if in memory or on disk) to work with it. Might make sense to just save the extracted form once. Not sure how good the dwca-reader handles this. Relating to your answer to 5. we could keep all DwC-As but delete the extracted form after we've processed it
# Things like lat/long swapped couldn't be easily detected by just looking at metadata though. That's something that Oliver's occurrence processing would handle. So this question is still open for me.
# Okay
# Okay I'll ignore the registry-metadata-service for now because it does not seem related to this. The metadata-synchronizer needs some rework for this. Not sure if that's worth it or if we want to start a new project for this. I'll leave this open for now.


Author: kbraak@gbif.org
Created: 2012-11-01 14:27:20.639
Updated: 2012-11-01 14:27:20.639
        
Regarding your points:

1. I like this suggestion.
2. OK
3. gbif-httputils is used to download and uncompress the archive. dwca-reader can understand the extracted form only, last time I checked. Right now the Super HIT isn't very conscientious of disk space saving the compressed, and uncompressed forms.
4. I just addressed validating both the metadata and core file in my earlier answer. Point is, there are these 2 different types of validation needed.
5. OK
6. Talk to you soon about this! :)


Author: omeyn@gbif.org
Comment: dwca harvesting working in production
Created: 2013-12-17 15:33:12.881
Updated: 2013-12-17 15:33:12.881