Issue 11948
The processing did not delete the non-updated records after the DwC archive finished indexing
11948
Reporter: jlegind
Assignee: kbraak
Type: Task
Summary: The processing did not delete the non-updated records after the DwC archive finished indexing
Priority: Major
Resolution: WontFix
Status: Closed
Created: 2012-09-26 12:01:17.297
Updated: 2016-02-22 10:14:55.465
Resolved: 2013-12-09 15:03:29.953
Description: For the resource ID 13674 University of Kansas, SNOW Entomology (Published through their IPT) - there were 109979 records that were not updated and were not subsequently deleted by the processing. This had to be done manually.
Since 500+ records were missing complete GBIF triplets, this might have blocked the automatic deletion trigger (pure conjecture on my part).
We should have some sort of discussion about the conditions for suspending auto deletion. One example could be that an element of the triplet were to be left out of a large part of the archive and that would cause legitimate records to be deleted. At what point do we want deletion to be reviewed before commencing?
Here is a summary from the log:
2012-09-24 16:27:19,113 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - start.issueMetadata
2012-09-24 16:27:19,132 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.start
2012-09-24 16:27:20,970 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.end
2012-09-24 16:27:20,974 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - Case 2: processing EML with an associated DWC-ARCHIVE either zip archive or CSV
2012-09-24 16:27:20,977 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.start
2012-09-24 16:27:38,731 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.decompress
2012-09-24 16:27:42,040 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - dwcarchivemetadatahandler.download.end
2012-09-24 16:28:13,641 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - The eml.xml validates successfully with GBIF EML profile
2012-09-24 16:28:13,645 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - First attempt to parse metadata - is it described using the GBIF Metadata profile?
2012-09-24 16:28:13,654 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - eml.xml file corresponds to EML version 7
2012-09-24 16:28:13,658 INFO [pool-1-thread-15] ContactFileUtils - writeOutputFile
2012-09-24 16:28:13,701 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - createBioDatasource.exists
2012-09-24 16:28:13,704 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - updateCount
2012-09-24 16:28:13,707 INFO [pool-1-thread-15] DwcArchiveMetadataHandler - end.issueMetadata
2012-09-24 16:28:49,351 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.start.processHarvested
2012-09-24 16:28:49,355 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.processHarvested.openArchive
2012-09-24 16:28:49,371 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.processHarvested.operating
2012-09-24 16:29:11,302 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested.writeOutputFile
2012-09-24 16:29:11,508 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested.writeOutputFile
2012-09-24 16:29:11,511 INFO [pool-1-thread-18] DwcArchiveHarvester - dwcarchive.end.processHarvested
2012-09-24 16:29:19,501 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Start synchronisation
2012-09-24 16:29:19,505 WARN [pool-1-thread-19] GBIFPortalSynchroniser - No existing data provider ID retrieved from Registry
2012-09-24 16:29:19,594 INFO [pool-1-thread-19] DataProviderDaoImpl - Found existing data provider with id 196 and uuid b554c320-0560-11d8-b851-b8a03c50a862 - updated
2012-09-24 16:29:19,610 INFO [pool-1-thread-19] GBIFPortalSynchroniser - dataProviderId=196
2012-09-24 16:29:19,663 INFO [pool-1-thread-19] GBIFPortalSynchroniser - data provider identifier persisted to Registry
2012-09-24 16:29:19,703 INFO [pool-1-thread-19] DataResourceDaoImpl - Found existing data resource with id 13674, and data_provider_id 196 - updated
2012-09-24 16:29:19,706 INFO [pool-1-thread-19] GBIFPortalSynchroniser - dataResourceId=13674
2012-09-24 16:29:19,712 INFO [pool-1-thread-19] ResourceAccessPointDaoImpl - Found existing resource access point with id 13907, data_provider_id 196, data_resource_id 13674, and remote_id_at_url - updated
2012-09-24 16:29:19,719 INFO [pool-1-thread-19] ResourceAccessPointDaoImpl - Delete existing namespace_mapping records and write anew for resource_access_point: 13907
2012-09-24 16:29:19,728 INFO [pool-1-thread-19] GBIFPortalSynchroniser - resourceAccessPointId=13907
2012-09-24 16:29:19,734 INFO [pool-1-thread-19] GBIFPortalSynchroniser - No agents associated to this data resource 13674 were collected.
2012-09-24 16:29:19,785 INFO [pool-1-thread-19] AgentDaoImpl - Found existing association between data_provider with id 196 and agent with id 1810 and agentType 2
2012-09-24 16:29:19,790 INFO [pool-1-thread-19] AgentDaoImpl - Found existing association between data_provider with id 196 and agent with id 4157 and agentType 2
2012-09-24 16:29:19,793 INFO [pool-1-thread-19] GBIFMessageLogger - The thread-local variable maxLogGroup = 133156847
2012-09-24 16:29:19,796 INFO [pool-1-thread-19] GBIFMessageLogger - Reading file: /mnt/fiber/super_hit/university_of_kansas_biodiversity_institute-b554c320/aae308f4-9f9c-4cdd-b4ef-c026f48be551/gbif_log_messages.txt
2012-09-24 16:29:19,822 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Number of GBIF log messages collected during 'Harvesting': 1
2012-09-24 16:29:20,827 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation started on raw occurrence records
2012-09-24 17:32:34,563 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation finished on raw occurrence records
2012-09-24 17:32:34,568 ERROR [pool-1-thread-19] GBIFPortalSynchroniser - The GBIF triplet (Catalogue Number & Collection Code & Institution Code) was incomplete for 578 records!
2012-09-24 17:32:34,571 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798010 raw_occurrence_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 17:32:34,575 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all identifier_record records belonging to this data resource ...
2012-09-24 17:32:34,580 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:32:34,583 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all image_record records belonging to this data resource ...
2012-09-24 17:33:51,486 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:51,491 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all link_record records belonging to this data resource ...
2012-09-24 17:33:51,495 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:51,498 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deleting all typification_record records belonging to this data resource ...
2012-09-24 17:33:52,776 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Deletion complete
2012-09-24 17:33:52,780 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation started on auxiliary tables
2012-09-24 18:24:28,409 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Synchronisation finished on auxiliary tables
2012-09-24 18:24:28,413 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798010 image_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,416 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 0 link_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,419 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 15398 typification_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,423 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 798176 identifier_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,426 INFO [pool-1-thread-19] GBIFPortalSynchroniser - 0 image_record records were updated or created from the image extension (caution: this count is irrespective of duplicate records)
2012-09-24 18:24:28,429 INFO [pool-1-thread-19] GBIFMessageLogger - The thread-local variable maxLogGroup = 133156848
2012-09-24 18:24:28,432 INFO [pool-1-thread-19] GBIFMessageLogger - Reading file: /mnt/fiber/super_hit/university_of_kansas_biodiversity_institute-b554c320/aae308f4-9f9c-4cdd-b4ef-c026f48be551/gbif_log_messages_synchronise.txt
2012-09-24 18:24:28,601 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Number of GBIF log messages collected during 'Synchronise': 63
2012-09-24 18:24:28,605 INFO [pool-1-thread-19] GBIFPortalSynchroniser - Finished synchronisation
]]>
Author: ahahn@gbif.org
Comment: Observed the same issue with http://data.gbif.org/datasets/resource/13706
Created: 2012-09-26 12:03:30.828
Updated: 2012-09-26 12:03:30.828
Author: jlegind@gbif.org
Created: 2012-09-27 09:53:04.518
Updated: 2012-09-27 09:53:04.518
Here is an error from HIT logs that might cause auto-deletion to be overrided
http://hit.gbif.org/console/list.html?datasourceId=2361
These are not real date time value errors, but a HIT bug.
These date/time errors should not prevent records from updating, but merely flag for correction.
Here is a quote from the HIT-log:
{quote}Update/Create of Raw Occurrence Record failed: PreparedStatementCallback; SQL []; Data truncation: Incorrect datetime value: '20095009-10-14' for column 'identification_date' at row 1; nested exception is com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect datetime value: '20095009-10-14' for column 'identification_date' at row 1
{quote}