Issue 16550

Duplication of UAM records due to collectionCode change

16550
Reporter: rdmpage
Assignee: trobertson
Type: Bug
Summary: Duplication of UAM records due to collectionCode change
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2014-10-25 09:00:22.9
Updated: 2015-03-02 15:15:07.336
Resolved: 2015-03-02 15:15:07.313
        
Description: The UAM Mammal Collection (Arctos) http://www.gbif.org/dataset/377be098-626f-4cc2-b4b5-35700050669a has been duplicated. GBIF reports 237,666 records, but the DWCA download has 118,980.

For example, UAM:Mamm:52179 appears twice:

http://gbif.org/occurrence/897060683
"lastCrawled": "2014-06-24T20:45:10.496+0000",
"lastParsed": "2014-05-29T21:07:31.242+0000",
"identifier": "urn:occurrence:Arctos:UAM:Mamm:52179:83967",

"collectionCode": "UAM Mammals",

http://gbif.org/occurrence/784746686
"lastCrawled": "2014-03-07T08:05:10.346+0000",
"lastParsed": "2014-05-29T17:08:36.371+0000",
"identifier": "urn:occurrence:Arctos:UAM:Mamm:52179:83967",

"collectionCode": "Mamm",

Both records have the same "identifier" field, but the "collectionCode" is different. If GBIF is basing identity on DwC triplets and not the identifier, then that would explain the problem. Is there a reason not to defer to the identifier?
]]>
    


Author: trobertson@gbif.org
Comment: Thanks Rod.  This will likely be picked up in the coming days
Created: 2014-10-25 15:25:05.103
Updated: 2014-10-25 15:25:05.103


Author: jlegind@gbif.org
Comment: The dataset has been cleaned of stale records. Content represents what we received during the latest crawl.
Created: 2014-10-29 09:40:03.241
Updated: 2014-10-29 09:40:03.241


Author: rdmpage
Comment: As a matter of interest, what happens to the "old" ids? Do we retain a record of their existence and what the records held, or do they just vanish?
Created: 2014-10-29 20:54:20.488
Updated: 2014-10-29 20:54:20.488


Author: trobertson@gbif.org
Comment: The old ids won't be reused, but the records are deleted 
Created: 2014-10-29 22:49:40.806
Updated: 2014-10-29 22:49:40.806


Author: trobertson@gbif.org
Comment: Reopening this, as I believe it might indicate a bug in identifier matching.  [~omeyn@gbif.org] can you please digest the above and comment, accept, or close as appropriate?
Created: 2014-10-31 12:20:29.984
Updated: 2014-10-31 12:20:29.984


Author: omeyn@gbif.org
Created: 2014-10-31 12:51:14.835
Updated: 2014-10-31 12:51:14.835
        
The problem here is that the publisher used dc:identifier and not dwc:occurrenceID. You can see this in the verbatim view of the record, and compare with eg this one: http://www.gbif.org/occurrence/925277090/verbatim which does use occurrenceID. Because we don't use dc:identifier as a lookup we have to resort to the triplet in this case, which means when CC changed it's as if it was a newly published record (no way to find our way to the original - the problem dwc:occID is meant to solve). [~jlegind@gbif.org] please contact the publisher and ask them to populate occId.

For deletions it's as Tim said: we delete the "physical" record and retire its id.
    


Author: rdmpage
Created: 2014-10-31 17:30:03.197
Updated: 2014-10-31 17:30:03.197
        
Huh?! So if a provider supplies an entry in the first column that they regard as the unique id of the row, you ignore it? That seems crazy (do providers know that you do this?).

As an aside, how does a provider assign a term to the id column (column 0)? Looking at the Darwin Core documentation http://www.gbif.org/resources/2557 there's nothing about the type of the  tag. Is it always assumed to be dc:identifier? If so, then you are basically preventing people from doing the obvious, sane thing, namely having stable, unique identifiers as the  column, and have those treated as the identifier for the record. 
    


Author: trobertson@gbif.org
Created: 2014-11-01 12:45:02.541
Updated: 2014-11-01 12:45:02.541
        
Thanks Rod - it is useful to get a frank perspective on this as it has stemmed from the collaborative approach of TDWG standards development which results in a design by committee, with an attempt to accommodate everything.  This inevitably leads to more options and complexity, but the process is very necessary as those are the requests from the data holders and unless their requests are supported they simply disengage.  It's a fairly delicate balance which I think has been managed pretty well given conflicting opinions.  All that said, perhaps we have arrived at a time when a more dictatorial approach will be tolerated, as we can now reference the issues that need fixed.  I'll bear this in mind as the W3C work pans out [1] and more expressive models will be possible.

I'll try and recall the outcomes that have led to the current status for background information:

The  in the DwC-A is an internal (e.g. within the DwC-A) identifier used as the foreign key between extension rows and the core row.  In almost all cases it is the same as dwc:occurrenceID but there was strong opposition to our forcing that, as many people wanted e.g. an integer internal key, but to expose a more complicated occurrenceID for external use, such as an LSID.  Some just wanted to reserve the possibility to do so in the future.

The DwC-A format allows you to map generic terms, and the GBIF profile of the DwC-A core record for occurrence [2] does not mention anything about dc:identifier.  It is possible to share more than the GBIF terms, but they only the terms in the GBIF definition will be interpreted.  Thus, while we pass on (verbatim) the dc:identifer, it does not currently play a part in record identification and I'm not aware of any documentation that would suggest people to use it for publishing to GBIF.

The GBIF IPT is the recommended tool for publishing DwC-A robustly to the GBIF network, as it is well controlled and integrates with the GBIF infrastructure - e.g. triggers indexing, and uses the GBIF profile of the DwC-A standards etc.  The DwC-A is effectively the text guidelines in the TDWG standard and other tools implement this (e.g. the ALA tools) but being a fairly flexible format, can result in the issues you are reporting.

[~omeyn@gbif.org] I think we should consider supporting dc:identifier as a fallback so that:
  - if present, dwc:occurrenceID takes precedence over all others
  - if present, dw:idenfitier takes precedence over all others
  - else, resort to triplet of IC+CC+CN
  - When an existing record is found to receive the addition of a dwc:occurrenceID or dc:identifier not already present, it is added to the lookup for that record.

[1] http://w3c.github.io/csvw/use-cases-and-requirements/ (see use case 21)
[2] http://rs.gbif.org/core/dwc_occurrence.xml
    


Author: rdmpage
Created: 2014-11-01 13:04:53.905
Updated: 2014-11-01 13:04:53.905
        
Thanks for the clarification. I think adding the  as a fallback would help, especially if some providers are using it as the globally unique id for the record. Having just been at the latest TDWG identifiers workshop, our community really seems determined to make this stuff as hard as we possibly can (sigh).

One thing I'm still not clear on is what term is associated with the , and how. For example, every meta.xml I've seen defines external terms for each column of the data EXCEPT for first (i.e., we always see 






{code}


In retrospect I now tend to think the use cases presented were not strong enough to warrant being accepted as requirements, which then defined the standard.