Issue 12334

"Too many open files" error interrupts synchronisation

12334
Reporter: ahahn
Assignee: kbraak
Type: Bug
Summary: "Too many open files" error interrupts synchronisation
Priority: Critical
Resolution: CantReproduce
Status: Closed
Created: 2012-11-20 12:17:22.387
Updated: 2013-12-09 15:01:41.851
Resolved: 2013-12-09 15:01:41.816
        
Description: Symptom observed: after rollover, previously indexed image records for the "Herbarium Berolinense (DR 1095) disappeared. This is spotted by coincidence (alert from the owning institution). Nothing changed on the side of the data publisher.

HIT logs (extract at the bottom): During synchronisation, the process interrupts with a message "Synchronisation was interrupted due to an error: /mnt/fiber/super_hit/botanic_garden_and_botanical_museum_berlin-dahlem-57254bd0/85714c48-f762-11e1-a439-00145eb45e9a/rorid_to_line_number.txt (Too many open files)".

In the regular process, prior to this, all auxiliary tables are flushed, to be re-written after synchronisation of the main records. As the process interrupts, the auxiliary records do not get re-written, even though they are contained in the harvested and processed data files, and even though the main records still exist (just not got updated).

On last appearance, the symptoms were overcome by a server restart. This can only be a temporary work-around, though, as the consequences (random interruption of synchronisation processes, irreproducible loss of records from auxiliary tables) are too fundamental, and as there is no easy way spot and restart failed synchronisations afterwards.

Things to consider:
- do we need to review the sequence of events around flushing and regenerating of tables? The current flow is designed under the assumption that all steps go smoothly, but this cannot be guaranteed. Should the flushing of auxiliary tables happen _after_ successful update of raw_occurrence_record, so that an interruption of this update would not lose the content of these tables?
- we need to be able to catch the "Too many open files" offender. I seem to remember we had a very similar issue early in HIT development, which was caused by file closing being handled in a place that got bypassed in certain situations.

-- HIT log context from http://hit.gbif.org/console/list.html?datasourceId=20:

2012-10-18 01:59:13.0 	Finished synchronisation
2012-10-18 01:59:13.0 	No GBIF log messages were collected during 'Synchronise' for data resource 1095
2012-10-18 01:59:13.0 	Reading file: /mnt/fiber/super_hit/botanic_garden_and_botanical_museum_berlin-dahlem-57254bd0/85714c48-f762-11e1-a439-00145eb45e9a/gbif_log_messages_synchronise.txt
2012-10-18 01:59:13.0 	The thread-local variable maxLogGroup = 133158866
2012-10-18 01:59:13.0 	Synchronisation was interrupted due to an error: /mnt/fiber/super_hit/botanic_garden_and_botanical_museum_berlin-dahlem-57254bd0/85714c48-f762-11e1-a439-00145eb45e9a/rorid_to_line_number.txt (Too many open files)
2012-10-18 01:59:13.0 	The file: /mnt/fiber/super_hit/botanic_garden_and_botanical_museum_berlin-dahlem-57254bd0/85714c48-f762-11e1-a439-00145eb45e9a/link_records.txt contains no records
2012-10-18 01:59:13.0 	Deletion complete
2012-10-18 01:59:07.0 	Deleting all typification_record records belonging to this data resource ...
2012-10-18 01:59:07.0 	Deletion complete
2012-10-18 01:59:07.0 	Deleting all link_record records belonging to this data resource ...
2012-10-18 01:59:07.0 	Deletion complete
2012-10-18 01:58:39.0 	Deleting all image_record records belonging to this data resource ...
2012-10-18 01:58:39.0 	Deletion complete
2012-10-18 01:58:27.0 	Deleting all identifier_record records belonging to this data resource ...
2012-10-18 01:58:27.0 	Synchronisation started on auxiliary tables
2012-10-18 01:58:27.0 	70 raw_occurrence_record records were updated or created (caution: this count is irrespective of duplicate records)
2012-10-18 01:58:27.0 	Synchronisation finished on raw occurrence records
2012-10-18 01:58:26.0 	Synchronisation started on raw occurrence records


-- stack trace behind the error line:
2012-10-18 01:59:13.0 	Synchronisation was interrupted due to an error: /mnt/fiber/super_hit/botanic_garden_and_botanical_museum_berlin-dahlem-57254bd0/85714c48-f762-11e1-a439-00145eb45e9a/rorid_to_line_number.txt (Too many open files)
java.io.RandomAccessFile. open ( RandomAccessFile.java: -2 )
java.io.RandomAccessFile.  ( RandomAccessFile.java: 212 )
org.gbif.harvest.portal.synchronise.GBIFPortalSynchroniser. synchroniseAbcd ( GBIFPortalSynchroniser.java: 2303 )
org.gbif.harvest.portal.synchronise.GBIFPortalSynchroniser. synchronise ( GBIFPortalSynchroniser.java: 1641 )
org.gbif.harvest.portal.synchronise.GBIFPortalSynchroniser. synchronise ( GBIFPortalSynchroniser.java: 1451 )
sun.reflect.GeneratedMethodAccessor553. invoke ( : -1 )
sun.reflect.DelegatingMethodAccessorImpl. invoke ( DelegatingMethodAccessorImpl.java: 25 )
java.lang.reflect.Method. invoke ( Method.java: 597 )
org.apache.commons.beanutils.MethodUtils. invokeExactMethod ( MethodUtils.java: 404 )
org.gbif.harvest.scheduler.SimpleScheduler. triggerMethodInvocation ( SimpleScheduler.java: 1090 )
org.gbif.harvest.scheduler.SimpleScheduler. access$500 ( SimpleScheduler.java: 57 )
org.gbif.harvest.scheduler.SimpleScheduler$RunnableJob. run ( SimpleScheduler.java: 334 )
java.util.concurrent.ThreadPoolExecutor$Worker. runTask ( ThreadPoolExecutor.java: 886 )
java.util.concurrent.ThreadPoolExecutor$Worker. run ( ThreadPoolExecutor.java: 908 )
java.lang.Thread. run ( Thread.java: 662 )]]>
    


Author: kbraak@gbif.org
Comment: Can't reproduce. Closing issue.
Created: 2013-12-09 15:01:41.847
Updated: 2013-12-09 15:01:41.847