Wholesale duplication of Australian Museum provider for OZCAM
18172
Reporter: rdmpage
Assignee: jlegind
Type: Feedback
Summary: Wholesale duplication of Australian Museum provider for OZCAM
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2016-01-25 13:58:38.422
Updated: 2016-06-07 10:54:03.567
Resolved: 2016-06-07 10:54:03.435
Description: This dataset http://www.gbif.org/dataset/dce8feb0-6c89-11de-8225-b8a03c50a862 is twice the size it should be, suggesting massive duplication.GBIF reports 2,283,117 records, whereas OZCAM reports 1226190 records http://collections.ala.org.au/public/show/dr340
I downloaded the DwCA file and running wc -l < occurrence.csv gives 1226190 rows.
So, GBIF has this data twice over. Given the issue [PF-2314] what I think has happened is that between indexing OZCAM added UUIDs (http://gbif.org/occurrence/1100616447 has a UUID, http://gbif.org/occurrence/485281448 doesn't ) and GBIF has therefore treated the records with UUIDs as new, rather than replacements for previous records without.
[Breathes deeply, and counts to ten]. We really should do something about this, it's crazy that we can so easily end up with millions of duplicate records!
]]>
Author: rdmpage
Comment: Discovered this issue while looking at PF-2314
Created: 2016-01-25 13:59:23.64
Updated: 2016-01-25 13:59:23.64
Author: mdoering@gbif.org
Created: 2016-01-25 14:19:40.644
Updated: 2016-01-25 14:19:40.644
Rod, I completely agree. This ends up being a huge waste of efforts for everyone.
At least for dwc archives, which contain all data of a published dataset, we should delete all previous records if they do not appear in the latest archive. Historic ids should be kept in the index (we do logical deletes) so we can revive a GBIF identifier in case the record shows up again at a later stage. I think we need to first convince Tim though, see http://dev.gbif.org/issues/browse/PF-2279
Author: rdmpage
Comment: I guess we can work on him when he gets back. What's becoming clear to me is that different providers manage their data very differently, some "get" the importance of stable ids for records, other don't, and this has consequences for GBIF. I'm beginning to think that GBIF needs some people working full time on data compliance/quality/indexing errors. And/or we open it up and put data on GitHub and let people flag this stuff. At a minimum it would be great to have much more visible quality/indexing flags on GBIF pages (for example, why don't we display links to JIRA issues for each record that's been flagged, and aggregate those for each dataset?)
Created: 2016-01-25 14:37:38.309
Updated: 2016-01-25 14:37:38.309
Author: jlegind@gbif.org
Created: 2016-01-26 09:43:40.896
Updated: 2016-01-26 09:43:40.896
Hi Rod, we are going to decide on rules for auto deletion of stale records in the near future and implementing them thereafter.
One of the things we are going to look for are cases of exact duplication of datasets which would suggest that record identifiers were changed (perhaps for no good reason).
Author: rdmpage
Created: 2016-01-26 16:31:55.866
Updated: 2016-01-26 16:31:55.866
[~jlegind@gbif.org] Thanks for the update. Quick question, are duplicates physically deleted or only logically deleted? Removing duplicates has the problem that people (like me) may have linked to occurrences that subsequently get removed.
Perhaps we could have a system we "deleted" occurrences are still accessible. Say, a web user will see a page saying "this record has been superseded (or whatever)" with a link to the replacement (if one exists). An API user might get a 301 ("moved"), which would give them the option of simply following redirects and ignoring the 301, or retrieving the original record.
As an example, GenBank has super a lot of DNA barcode records, but you can still see them on their web site.
Author: jlegind@gbif.org
Created: 2016-01-28 12:26:49.027
Updated: 2016-01-28 12:26:49.027
[~rdmpage] We do hard deletions on records in the GBIF occurrence store. The only way to retrieve deleted records is to look into the quarterly snapshots that we take of the database which only contains a subset of the columns.
I am not aware of any plans to implement logical deletions. But that could change.
Author: rdmpage
Comment: [~jlegind@gbif.org] Yeah I was afraid of that. I guess there are potential data storage issues but breaking links by design is a Bad Thing™. it's really hard to encourage people to use identifiers if GBIF itself routinely breaks links. I've encountered datasets with links to GBIF (such as the infamous chameleon Red List dataset) where almost all the links are broken (and not just because of the death of http://data.gbif.org). It would also be good to think about what happens if and when GBIF starts clustering occurrences that are the same due to duplication ACROSS datasets (e.g., museum specimens that are also in DNA barcode databases).
Created: 2016-01-28 12:33:24.991
Updated: 2016-01-28 12:33:24.991
Author: mdoering@gbif.org
Created: 2016-01-28 13:12:58.176
Updated: 2016-01-28 13:12:58.176
[~jlegind@gbif.org] maybe we should consider logical deletion sooner than later. It would at least avoid implementing complicated deletion rules and should not be hard to implement. Im just not sure about the growing hbase storage size.
Rod is right that in order to encourage the use of GBIF identifiers they really need to stick and resolve to something. We do logical deletions for datasets and also for species in the GBIF backbone now.
Author: kbraak@gbif.org
Created: 2016-06-03 09:48:39.931
Updated: 2016-06-03 09:48:39.931
[~jlegind@gbif.org]
OZCAM now reports 1.307.688 records
http://collections.ala.org.au/public/show/dr340
GBIF now reports 1,302,896 records
http://www.gbif.org/dataset/dce8feb0-6c89-11de-8225-b8a03c50a862
The massive duplication appears gone, but about 5000 records were dropped, or they need to be re-crawled again.
Author: jlegind@gbif.org
Comment: Dataset recrawled and redundant records deleted
Created: 2016-06-07 10:54:03.512
Updated: 2016-06-07 10:54:03.512