Issue 18250

ROM vertebrate specimens in triplicate

18250
Reporter: rdmpage
Assignee: jlegind
Type: Task
Summary: ROM vertebrate specimens in triplicate
Priority: Unassessed
Status: Open
Created: 2016-02-20 16:46:30.012
Updated: 2016-02-22 10:21:11.38
Resolved: 2016-02-20 16:46:30.002
        
Description: Vertebrate specimens from ROM are present in THREE copies. For example, for mammals there are two datasets, "Mammal Specimens" http://www.gbif.org/dataset/84bac07c-f762-11e1-a439-00145eb45e9a and "Mammalogy Collection - Royal Ontario Museum" http://www.gbif.org/dataset/c5c4a23e-2035-4416-ab64-032d6df52ddb This accounts for one case of duplication (e.g., ROM 46127) but then within "Mammalogy Collection - Royal Ontario Museum" we have a further two "ROM 46127", one with occurrenceID "URI:catalog:ROM:Mammals:46127" (why "URI" instead of "URN is anybody's guess).

So, I think we have:

a) an old data set that should be retired (1 copy of record)

b) a new data set that between indexing has had the occurrence IDs changed/added, and these have been treated as new records (2 copies of record).

This sort of thing drives me nuts. It would be nice if data providers checked that they have only one version of the data in GBIF, and it would be great if GBIF could be a little more cautious about how it interprets occurrenceIDs. If a dataset being indexed has them whereas the last version didn't, chances ar it's the same dataset. Likewise, if there's a discrepancy between the number of occurrences reported by GBIF and the data provider (e.g., in the provided Dawrin Core archive) then that should be a red flag that something is up (ROM says it has 120,000 records, GBIF says over 200,000.

This problem affects other ROM datasets such as fish.]]>