Issue 11948

The processing did not delete the non-updated records after the DwC archive finished indexing

11948
Reporter: jlegind
Assignee: kbraak
Type: Task
Summary: The processing did not delete the non-updated records after the DwC archive finished indexing
Priority: Major
Resolution: WontFix
Status: Closed
Created: 2012-09-26 12:01:17.297
Updated: 2016-02-22 10:14:55.465
Resolved: 2013-12-09 15:03:29.953
        
Description: For the resource ID 13674 University of Kansas, SNOW Entomology (Published through their IPT) - there were 109979 records that were not updated and were not subsequently deleted by the processing. This had to be done manually.

Since 500+ records were missing complete GBIF triplets, this might have blocked the automatic deletion trigger (pure conjecture on my part).
We should have some sort of discussion about the conditions for suspending auto deletion. One example could be that an element of the triplet were to be left out of a large part of the archive and that would cause legitimate records to be deleted. At what point do we want deletion to be reviewed before commencing?

Here is a summary from the log:

2012-09-24 16:27:19,113  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - start.issueMetadata
2012-09-24 16:27:19,132  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.start
2012-09-24 16:27:20,970  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.end
2012-09-24 16:27:20,974  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - Case 2: processing EML with an associated DWC-ARCHIVE either zip archive or CSV
2012-09-24 16:27:20,977  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.start
2012-09-24 16:27:38,731  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.decompress
2012-09-24 16:27:42,040  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.end
2012-09-24 16:28:13,641  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - The eml.xml validates successfully with GBIF EML profile
2012-09-24 16:28:13,645  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - First attempt to parse metadata - is it described using the GBIF Metadata profile?
2012-09-24 16:28:13,654  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - eml.xml file corresponds to EML version 7
2012-09-24 16:28:13,658  INFO [pool-1-thread-15] ContactFileUtils - writeOutputFile
2012-09-24 16:28:13,701  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - createBioDatasource.exists
2012-09-24 16:28:13,704  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - updateCount
2012-09-24 16:28:13,707  INFO [pool-1-thread-15] DwcArchiveMetadataHandler - end.issueMetadata
2012-09-24 16:28:49,351  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.start.processHarvested
2012-09-24 16:28:49,355  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.processHarvested.openArchive
2012-09-24 16:28:49,371  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.processHarvested.operating
2012-09-24 16:29:11,302  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested.writeOutputFile
2012-09-24 16:29:11,508  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested.writeOutputFile
2012-09-24 16:29:11,511  INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested
2012-09-24 16:29:19,501  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Start synchronisation
2012-09-24 16:29:19,505  WARN [pool-1-thread-19] GBIFPortalSynchroniser - No existing data provider ID retrieved from Registry
2012-09-24 16:29:19,594  INFO [pool-1-thread-19] DataProviderDaoImpl - Found existing data provider with id 196 and uuid b554c320-0560-11d8-b851-b8a03c50a862 - updated
2012-09-24 16:29:19,610  INFO [pool-1-thread-19] GBIFPortalSynchroniser - dataProviderId=196
2012-09-24 16:29:19,663  INFO [pool-1-thread-19] GBIFPortalSynchroniser - data provider identifier persisted to Registry
2012-09-24 16:29:19,703  INFO [pool-1-thread-19] DataResourceDaoImpl - Found existing data resource with id 13674, and data_provider_id 196 - updated
2012-09-24 16:29:19,706  INFO [pool-1-thread-19] GBIFPortalSynchroniser - dataResourceId=13674
2012-09-24 16:29:19,712  INFO [pool-1-thread-19] ResourceAccessPointDaoImpl - Found existing resource access point with id 13907, data_provider_id 196, data_resource_id 13674, and remote_id_at_url  - updated
2012-09-24 16:29:19,719  INFO [pool-1-thread-19] ResourceAccessPointDaoImpl - Delete existing namespace_mapping records and write anew for resource_access_point: 13907
2012-09-24 16:29:19,728  INFO [pool-1-thread-19] GBIFPortalSynchroniser - resourceAccessPointId=13907
2012-09-24 16:29:19,734  INFO [pool-1-thread-19] GBIFPortalSynchroniser - No agents associated to this data resource 13674 were collected.
2012-09-24 16:29:19,785  INFO [pool-1-thread-19] AgentDaoImpl - Found existing association between data_provider with id 196 and agent with id 1810 and agentType 2
2012-09-24 16:29:19,790  INFO [pool-1-thread-19] AgentDaoImpl - Found existing association between data_provider with id 196 and agent with id 4157 and agentType 2
2012-09-24 16:29:19,793  INFO [pool-1-thread-19] GBIFMessageLogger - The thread-local variable maxLogGroup = 133156847
2012-09-24 16:29:19,796  INFO [pool-1-thread-19] GBIFMessageLogger - Reading file: /mnt/fiber/super_hit/university_of_kansas_biodiversity_institute-b554c320/aae308f4-9f9c-4cdd-b4ef-c026f48be551/gbif_log_messages.txt
2012-09-24 16:29:19,822  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Number of GBIF log messages collected during 'Harvesting': 1
2012-09-24 16:29:20,827  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation started on raw occurrence records
2012-09-24 17:32:34,563  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation finished on raw occurrence records
2012-09-24 17:32:34,568 ERROR [pool-1-thread-19] GBIFPortalSynchroniser - The GBIF triplet (Catalogue Number & Collection Code & Institution Code) was incomplete for 578 records!
2012-09-24 17:32:34,571  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798010 raw_occurrence_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 17:32:34,575  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all identifier_record records belonging to this data resource ...
2012-09-24 17:32:34,580  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:32:34,583  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all image_record records belonging to this data resource ...
2012-09-24 17:33:51,486  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:51,491  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all link_record records belonging to this data resource ...
2012-09-24 17:33:51,495  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:51,498  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all typification_record records belonging to this data resource ...
2012-09-24 17:33:52,776  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:52,780  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation started on auxiliary tables
2012-09-24 18:24:28,409  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation finished on auxiliary tables
2012-09-24 18:24:28,413  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798010 image_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,416  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 0 link_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,419  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 15398 typification_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,423  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798176 identifier_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,426  INFO [pool-1-thread-19] GBIFPortalSynchroniser - 0 image_record records were updated or created from the image extension (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,429  INFO [pool-1-thread-19] GBIFMessageLogger - The thread-local variable maxLogGroup = 133156848
2012-09-24 18:24:28,432  INFO [pool-1-thread-19] GBIFMessageLogger - Reading file: /mnt/fiber/super_hit/university_of_kansas_biodiversity_institute-b554c320/aae308f4-9f9c-4cdd-b4ef-c026f48be551/gbif_log_messages_synchronise.txt
2012-09-24 18:24:28,601  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Number of GBIF log messages collected during 'Synchronise': 63
2012-09-24 18:24:28,605  INFO [pool-1-thread-19] GBIFPortalSynchroniser - Finished synchronisation
  ]]>
    


Author: ahahn@gbif.org
Comment: Observed the same issue with http://data.gbif.org/datasets/resource/13706
Created: 2012-09-26 12:03:30.828
Updated: 2012-09-26 12:03:30.828


Author: jlegind@gbif.org
Created: 2012-09-27 09:53:04.518
Updated: 2012-09-27 09:53:04.518
        
Here is an error from HIT logs that might cause auto-deletion to be overrided
http://hit.gbif.org/console/list.html?datasourceId=2361
These are not real date time value errors, but a HIT bug.

These date/time errors should not prevent records from updating, but merely flag for correction.
Here is a quote from the HIT-log:
{quote}Update/Create of Raw Occurrence Record failed: PreparedStatementCallback; SQL []; Data truncation: Incorrect datetime value: '20095009-10-14' for column 'identification_date' at row 1; nested exception is com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect datetime value: '20095009-10-14' for column 'identification_date' at row 1
{quote} 
    


Author: kbraak@gbif.org
Comment: Won't fix. Closing issue.
Created: 2013-12-09 15:03:30.066
Updated: 2013-12-09 15:03:30.066