Issue 12700

Missing dataset keys in occurrence records

12700
Reporter: fmendez
Assignee: omeyn
Type: Bug
Summary: Missing dataset keys in occurrence records
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2013-02-05 11:42:45.48
Updated: 2013-12-17 15:16:42.909
Resolved: 2013-03-06 14:45:12.439
        
Description: There are 10,569,089 occurrence records without a dataset key. The list of those records can be retrieved from Solr:
http://boma.gbif.org:8080/occurrence-solr/select?q=-dataset_key:*

An example of one those records shows the absence of dataset key field:

http://staging.gbif.org:8080/occurrence-ws/occurrence/30189966

]]>
    

Author: omeyn@gbif.org
Comment: was a bug in registry-ws that returned null data in the json for datasets sharing the same id with an organization.
Created: 2013-02-07 15:03:36.231
Updated: 2013-02-07 15:03:36.231

Author: fmendez@gbif.org
Created: 2013-02-11 12:16:29.289
Updated: 2013-02-11 12:16:29.289
        
After the latest fix, there are still occ records without dataset_key, this time the amount of records is less than before: 5,396,515. Some of those occ records keys are:
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    
      
    
    

Author: omeyn@gbif.org
Comment: All of the records that have missing dataset keys have data resource ids that the staging registry-ws does not know about, presumably because they've been deleted from the registry between the time of the last rollover and when the latest copy was put on staging. The solution is to either delete the occurrence records in question or re-import from the latest rollover (copied from mysql feb 11). I've chosen to re-import.
Created: 2013-02-21 08:53:28.462
Updated: 2013-02-21 08:53:28.462

Author: trobertson@gbif.org
Created: 2013-02-21 10:14:02.074
Updated: 2013-02-21 10:14:02.074
        
In live operations (e.g. eventually) changes in the registry to a dataset will broadcast a message.  The occurrence store will then start piling through (e.g.) deletions.  During that period of inconsistency, systems will have to tolerate the inconsistency.

The issue shows so clearly because a deleted dataset currently returns 404 in the registry web services (e.g. [1]).  This is being addressed in registry2 so that it will actually return the record, but with the deleted timestamp.

Even following a reimport, I would not be surprised to find records with missing keys (since we know the registry is too relaxed on constraint checks).  I would expect you might need to import, and then delete.

[1] http://staging.gbif.org:8080/registry-ws/dataset/cd602780-c183-4b74-b541-61f1d42f204c
    

Author: omeyn@gbif.org
Created: 2013-02-22 17:43:38.308
Updated: 2013-02-22 17:43:38.308
        
A deletion service now exists as described here: http://dev.gbif.org/wiki/display/INT/Occurrence+Deletion. The cli code is uncommitted for reasons described in OCC-157, but that shouldn't take too long to fix.

I've reimported from the latest portal_rollover on mogo, and [~jcuadra@gbif.org] copied the latest registry from live to staging, but unfortunately there are still 77M records without dataset keys. That appears to be because of problems in the registry which Jose is investigating.
    

Author: omeyn@gbif.org
Comment: 72M of those were an ebird re-registration so that's sorted in the staging registry now. The remaining 5M or so records with no uuid have been deleted (were deleted datasets in the registry). The staging_occurrence table should therefore be ok now.
Created: 2013-03-05 15:17:30.841
Updated: 2013-03-05 15:17:30.841

Author: omeyn@gbif.org
Comment: there were a further 200k or so records that had dataset uuids but the registry had those uuids marked as deleted. those occurrences have now been deleted too, and the table is clean.
Created: 2013-03-06 14:45:03.134
Updated: 2013-03-06 14:45:03.134