Issue 13541

Make the fragment dwca identifier rules concrete

13541
Reporter: omeyn
Assignee: omeyn
Type: Bug
Summary: Make the fragment dwca identifier rules concrete
Description: At the moment we accept triplets and occurrence ids if the validator says they're perfect. But we need to be 100% clear on whether we accept occ ids if triplet is bad (e.g. no inst code) and similarly. Basically get http://dev.gbif.org/wiki/display/INT/Identifier+problems+and+how+to+solve+them up to date.
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-07-31 09:37:02.602
Updated: 2013-12-17 15:16:43.538
Resolved: 2013-12-17 14:49:13.405


Author: mdoering@gbif.org
Comment: I was just reading a dwc comment about institutionID vs institutionCode - do we also consider a triplet of institutionID, collectionID and catalogNumber as a valid occurrence identifier?
Created: 2013-07-31 09:45:31.975
Updated: 2013-07-31 09:45:31.975


Author: kbraak@gbif.org
Created: 2013-08-02 10:14:22.732
Updated: 2013-08-02 10:14:22.732
        
[~mdoering@gbif.org], for historical knowledge: the HIT has always accepted collectionID and datasetID as alternatives for collectionCode, when collectionCode is absent during indexing.

Furthermore, OccurrenceID has been used as an alternative for catalogNumber, when catalogNumber is absent during indexing. There is no alternative identifier for institutionCode.

Of course these alternative terms only exist for Darwin Core Archives, based on the new Darwin Core terms. For reference, here is the HIT's Darwin Core terms mapping file: https://code.google.com/p/gbif-indexingtoolkit/source/browse/trunk/harvest-service/src/main/resources/org/gbif/harvest/dwcarchive/mapping/indexMapping_dwc.properties


    


Author: mdoering@gbif.org
Created: 2013-08-02 10:32:53.147
Updated: 2013-08-02 10:33:38.139
        
There is an institutionID: http://rs.tdwg.org/dwc/terms/index.htm#institutionID

As I understand Darwin Core there are the following terms that could be involved in identifying an occurrence record.
I am not saying we need to support all of them, but if we could I think stability of GBIF record ids would surely increase. We might even be able to detect true duplicates.

1) *occurrenceID*
{{occurrenceID}} as a standalone id within a dataset. It might make sense to also take {{datasetID}} into account to allow non GUIDs as occurrenceID identifiers. Imagine if aggregators like GBIF would return mixed records as we do in our download service, but use the original occurrenceID as John W suggests. Without the dataset context all non GUID ids are likely to not be unique.

2) *Triplet*
{{catalogNumber}} in combination with:
{{collectionID}} / {{collectionCode}}
{{institutionID}} / {{institutionCode}}
If the collection ID actually is a true GUID the institutionID/Code should be irrelevant

3) *recordNumber*
{{recordNumber}} & {{recordedBy}}: I am not aware anyone uses these, but the number applied to a record in the field plus the identity of the recorder should also be a pretty good identifier. If this is still doubtly one could also add the collecting year to have greater confidence.
    


Author: omeyn@gbif.org
Comment: The wiki document is up to date, and these id discussions are ongoing.
Created: 2013-12-17 14:49:13.443
Updated: 2013-12-17 14:49:13.443