Issue 18103

Auto-delete occurrences that are no longer in a source DwC-A

18103
Reporter: peterdesmet
Assignee: cgendreau
Type: Improvement
Summary: Auto-delete occurrences that are no longer in a source DwC-A
Resolution: Fixed
Status: Closed
Created: 2015-12-18 16:34:23.766
Updated: 2018-05-31 16:16:11.401
Resolved: 2018-05-31 16:16:04.541
        
        
Description: On republishing our datasets, we often encounter discrepancies in the number of occurrences at our source dataset and the number of occurrences displayed on GBIF. GBIF overestimates the number of occurrences, because it does not auto-delete occurrences that were deleted from the source dataset (see also http://dev.gbif.org/issues/browse/POR-2343)

This is annoying, because:

1. We broadcast the DOI of our datasets wherever we can, so the GBIF dataset page is the landing page for our datasets, with an incorrect number of occurrences.
2. To solve this, we have to periodically check for discrepancies, report these (e.g. https://github.com/LifeWatchINBO/data-publication/issues/14 or https://github.com/LifeWatchINBO/data-publication/issues/90) and contact email portal@gbif.org to ask to manually remove those.
3. We cannot easily verify if all records are indexed by GBIF. We are currently trying to solve an issue of missing occurrences, but because the deleted occurrences are still there, we cannot compare numbers.
4. Publishers might have good reasons to delete records from a dataset: they were erroneous, they were published before an embargo period ended, etc. Having them still on GBIF doesn't breed confidence in organizations that are still a bit reluctant in publishing data.

I understand that auto-delete is not used for DIGIR-providers, where GBIF almost acts as a repository in case the provider is no longer responsive, but I think the behaviour should be different for Darwin Core archives, especially on republication. There the publisher actively triggers a republication and this should trigger an auto-delete for records that are no longer present in that dataset.]]>
    
Attachment stats.tsv


Author: trobertson@gbif.org
Created: 2015-12-18 16:47:25.377
Updated: 2015-12-18 16:47:25.377
        
The reason this was not done originally was because it was often the case that there was instability in record identifiers, and in many cases people accidentally made mistakes.
How about a solution that deletes automatically, provided that the total number of records being deleted does not breach a threshold?  e.g. 25% of the total records in the dataset.

This would catch the majority of issues, but still provide a human intervention to catch the worst issues where a user accidentally publishes a dataset and duplicates the whole set, and would rather delete the new ones and fix the old records, preserving existing IDs.
    


Author: peterdesmet
Comment: @tim: yes, that seems like a good suggestion. As you can see here, our current situation would be covered with 25%: https://github.com/LifeWatchINBO/data-publication/issues/90
Created: 2015-12-18 16:52:32.044
Updated: 2015-12-18 16:52:32.044


Author: mdoering@gbif.org
Created: 2016-02-22 10:20:09.573
Updated: 2016-02-22 10:20:09.573
        
I would favour an even more automated solution that always deletes all missing records. In order to keep GBIF ids stable we would need to never fully delete at least the terms responsible for mapping to the GBIF ID, i.e. occurrenceID & the triplet. But we could also consider keeping the entire record and just logically delete it as we do in the registry and for backbone usage records. We have so many datasets in a bad state and they all need continous manual work to keep the data good I think the amount of extra coding is well spent here

    


Author: mblissett
Comment: Logical deletion, and other ways a record is hidden from various views of the dataset, is probably useful for other kinds of data.  For example, poor-quality / unreviewed citizen science observations Donald mentioned this morning.
Created: 2016-02-22 11:35:59.057
Updated: 2016-02-22 11:35:59.057


Author: mblissett
Created: 2016-02-22 15:36:25.109
Updated: 2016-02-22 15:36:25.109
        
I wrote a quick shell script to count the number of occurrences in a DWCA and compare it to the number reported by our API. I'm using the cache on the NFS server. The script assumes one header link in the occurrence file, and doesn't handle newlines in data (so it will be wrong if someone has "abc \n xyz" in many records).

The cache has 5843 files.

575 of them aren't valid ZIP files, they are things like 404 pages.  This could be a bug, if the dataset disappeared and we lost the good DWCA we did have in the cache.

1674 DWCAs are for deleted datasets, 1637 for checklists, 5 metadata, 14 sampling event.

That's left 1957, which is about right.

56 datasets have occurrences in their DWCA, but there are zero in the index.

25 have fewer than 75% of what's in the occurrence file in the index.

1781 have between 75% and 125%.

And 96 have over 175% in the index compared to the number in the DWCA, including 39 with exactly double.

    


Author: mblissett
Created: 2016-02-22 15:37:29.301
Updated: 2016-02-22 15:37:29.301
        
Stats as described.

    


Author: mdoering@gbif.org
Comment: interesting. checklists might expose the same behavior when they have an attached occurrence extension
Created: 2016-02-22 16:06:41.831
Updated: 2016-02-22 16:06:41.831


Author: mdoering@gbif.org
Comment: thats about 20 million records more in the index that in the files, quite some deletion needed
Created: 2016-02-22 16:39:50.638
Updated: 2016-02-22 16:39:50.638


Author: peterdesmet
Created: 2016-02-25 11:06:16.504
Updated: 2016-02-25 11:06:16.504
        
So there are 1674 datasets where the original DwC-A has been removed (accidentally or on purpose)? That seems like a lot: it's almost as much as valid, completely indexed DwC-A (1957). Or am I misreading the stats?

Also: can you check if these 8 datasets (http://dev.gbif.org/issues/browse/PF-2365) are included in the 56 datasets in your stats?
    


Author: mblissett
Created: 2016-02-25 11:24:04.181
Updated: 2016-02-25 11:24:04.181
        
It does seem odd.  It could be an error in my quick script to get this information, or perhaps some old test datasets.

The 8 are included in the 56.

    


Author: sant
Created: 2016-10-28 19:37:43.374
Updated: 2016-10-28 19:52:43.977
        
Hello. And sorry if I have not a deep technical understanding of the problem.
Perhaps this should not be posted here but in IPT mailing list?

My main concern with those "records pending to be deleted" is not the fact that the number of records do not match.
My concern is that people could be downloading several versions of the same record: one which is updated and the others which are not.

I want to ask if this trick could somehow solve that problem.  This would be just a temporary manual workaround, which does not solve the issue of autodeletions.  But for my concerns, this could be a good solution (if you say it works):

1) For some reason a record in our herbarium collection database was wrong. We decided to delete it.
    So, a record with the occurrenceID xxxxx is not present any more in the DwCa exposed by our IPT.
    But the old record is still visible in GBIF portal.

2) I take note of that occurrenceID and I MANUALLY INJECT A BLANK RECORD IN OUR DATABASE, which keeps exposing that occurrenceID value, but nothing else [ (*) ].
I republish our resource. Now I can autodelete all those blank records from our database.

3) When GBIF reindexes, the stable occurrenceID forces update of that record to blank values in all relevant fields [ (*) ].
   So at least it will stop being returned by any of the typical searches (country, coordinates, scientificname, ...), which is my main concern.

4) Next time I republish to IPT the record will not be present anymore in our DwCa (because it is not in our database).
    So whenever GBIF decides to delete it (probably soon, because of being a blank record), it should finally dissapear.

(*) = Perhaps IPT will not let me republish the resource if there are blanks in mandatory DwC fields?
In that case, perhaps GBIF could stablish some kind of "FLAGS" for those certain fields? So providers could use that flag to request deletion of those records? (so you are sure this has not been done by mistake, and could even filter these records out, so they are not shown in GBIF data portal)

i.e. ScientificName = "AUTO-DELETE-ME"
(or create a special boolean field to be used as a "delete-me-flag")

I know it sounds dirty but perhaps it could serve to avoid many record-delete email requests.
What would happen right now if I follow this strategy and try to insert almost-blank records?
Would it be more a problem than a solution?

Thanks

David García
SANT Herbarium


    


Author: trobertson@gbif.org
Created: 2016-10-28 21:31:20.977
Updated: 2016-10-28 21:31:20.977
        
You need an occurrenceID and I think a basisOfRecord (set it to Occurrence / Unknown), but otherwise the strategy is correct.  Please do not put deliberately dirty data like "DELETE ME" in a dataset.

I think though while this issue remains open, mailing a list of GBIF IDs to delete would normally be actioned on within a business day and might be the easiest option.

    


Author: sant
Created: 2016-10-29 12:15:06.883
Updated: 2016-10-29 12:15:06.883
        
Of course I wouldn't put wrong data unless you recommend a certain dirty flag.
The original idea was to simply put blank data.

Just wondering if GBIF could take profit of flags to remark non-mistakes
(since your first answer mentioned accidentally made mistakes).

So I will try the almost-blank records (occurrenceID + basisOfRecord).
And I will also mail you the list of their gbifids.

Thanks a lot, Tim

David García
SANT Herbarium

    


Author: sant
Created: 2016-12-21 13:45:18.188
Updated: 2016-12-21 13:45:42.453
        
Just to share the experience after trying to apply the above suggested strategy:

- For wrong records intended to be removed from gbif portal, we just left everything blank except occurrenceID and basisOfRecord ("Unknown"). But IPT failed to publish dataset.

- So we made tests changing a single record (removing one field each time and trying to republish) until we found those fields which caused errors.

- It was necessary to provide real date in EarliestDateCollected and LatestDateCollected mapped occurrence concepts.
  Otherwise IPT raised errors when trying to publish the dataset.

- The basisOfRecord value "Unknown" was not accepted by IPT. We changed it to "Occurrence" to avoid an error.
  After making that change, we got this warning instead, but dataset was succesfully published:
  "5087 line(s) use ambiguous basisOfRecord 'occurrence'. It is advised that occurrence be reserved for cases when the basisOfRecord is unknown. Otherwise, a more specific basisOfRecord should be chosen"

So, records were successfully "hidden" from searches (taxonomic, geographic)

Thanks for your help

David García
SANT Herbarium

    


Author: mblissett
Created: 2018-05-31 16:16:04.709
Updated: 2018-05-31 16:16:04.709
        
This was completed in 2016, and has been running well since.

(This is the old issue tracker, any problems should be filed on GitHub.)