Issue 17318

Occurrence ID lookup not working

17318
Reporter: trobertson
Assignee: omeyn
Type: Bug
Summary: Occurrence ID lookup not working
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2015-02-25 13:37:15.354
Updated: 2015-12-14 19:44:13.371
Resolved: 2015-12-14 19:44:13.286
        
Description: We are seeing duplicate records for occurrences when the following scenario happens:

A record was indexed in 2013 with institutionCode, collectionCode and catalogNumber.  This is shown in HBase as:
{code}
 b929f23d-290f-4e85-8f17-764c55b3b284|BPBM|BISH|44941|null                      column=o:i, timestamp=1378985997941, value=\x02(\xBCy
{code}

The publisher ([~larussell] on behalf of the Bishop Museum) added occurrenceID and was very careful to ensure the triplets remained the same.  On publishing the dataset duplicated.

In HBase we then see:
{code}
hbase(main):002:0> scan 'prod_b2_occurrence_lookup', {STARTROW => 'b929f23d-290f-4e85-8f17-764c55b3b284|94e03835-9733-495e-ae2c-0a6653126d86', ENDROW => 'b929f23d-290f-4e85-8f17-764c55b3b284|94e03835-9733-495e-ae2c-0a6653126d87'}
ROW                                                                             COLUMN+CELL
 b929f23d-290f-4e85-8f17-764c55b3b284|94e03835-9733-495e-ae2c-0a6653126d86      column=o:i, timestamp=1424838582424, value=? !\x85
 b929f23d-290f-4e85-8f17-764c55b3b284|94e03835-9733-495e-ae2c-0a6653126d86      column=o:s, timestamp=1424838582426, value=ALLOCATED
1 row(s) in 0.0090 seconds
{code}

On looking at the code, it appears that on iterating possible keys, unless the status is seen to be ALLOCATED, a new identifier will be minted.

The original record did not have a status, and thus this line is never executed:
https://github.com/gbif/occurrence/blob/master/occurrence-persistence/src/main/java/org/gbif/occurrence/persistence/keygen/HBaseLockingKeyService.java#L80

Subsequently this line is never executed:
https://github.com/gbif/occurrence/blob/master/occurrence-persistence/src/main/java/org/gbif/occurrence/persistence/keygen/HBaseLockingKeyService.java#L89

Subsequently, foundKey is always NULL and it will mint a new one here:
https://github.com/gbif/occurrence/blob/master/occurrence-persistence/src/main/java/org/gbif/occurrence/persistence/keygen/HBaseLockingKeyService.java#L164

Please note that these records represent the original records loaded in 2013 when we migrated to HBase.  Is it possible perhaps that they are all missing a status column accidentally?]]>
    


Author: omeyn@gbif.org
Comment: I added ALLOCATED to all lookups, as this appears to have been an oversight. Tried the dataset again, but same problem. After that tried rebuilding all lookups for this dataset from the existing occurrences, and reran crawl. same problem. Investigation continues.
Created: 2015-03-11 15:16:25.329
Updated: 2015-03-11 15:16:25.329


Author: omeyn@gbif.org
Comment: was actual code bug. fixed in https://github.com/gbif/occurrence/commit/76efcbac3f11c151fc94e300dc89aa41a82f26ca (note bad sysout removed afterwards). Tested bishop dataset in uat and worked nicely. Pending occ release and deploy before running in prod.
Created: 2015-03-13 15:52:06.225
Updated: 2015-03-13 15:52:06.225


Author: omeyn@gbif.org
Created: 2015-03-23 09:37:23.102
Updated: 2015-03-23 09:37:23.102
        
But, we weren't even hitting that bug. Here's the validation report from dwca-validator:

crawler-dwca-validator.2015-03-20.log:INFO  [2015-03-20 14:54:27,002+0100] [pool-1-thread-1] org.gbif.crawler.dwca.validator.ValidatorService: Finished validating DwC-A for dataset [b929f23d-290f-4e85-8f17-764c55b3b284], valid? is [true]. Full report [DwcaValidationReport{datasetKey=b929f23d-290f-4e85-8f17-764c55b3b284, invalidationReason=null, occurrenceReport=OccurrenceValidationReport{checkedRecords=419047, uniqueTriplets=415026, recordsWithInvalidTriplets=959, uniqueOccurrenceIds=419047, recordsMissingOccurrenceId=0, allRecordsChecked=true, invalidationReason=null, valid=true}, checklistReport=null}]

Showing both invalid triplets (missing collection code i think) but more importantly duplicate triplets. The fragment processor takes this report and decides, based on the duplicated triplets, that it shouldn't use triplets at all. Because occIds are all good it uses those exclusively, and because we haven't seen them before all records are considered new.
    


Author: jlegind@gbif.org
Created: 2015-03-23 11:12:54.454
Updated: 2015-03-23 11:12:54.454
        
Publisher has been contacted about crawling the dataset to get the occurrenceIDs in and deleting the old records.
This will be a 'hard' migration that creates new record IDs.
    


Author: omeyn@gbif.org
Comment: fixed with the commits that went in at the time, now in production
Created: 2015-12-14 19:44:13.337
Updated: 2015-12-14 19:44:13.337