Issue 11837

Tapir: inventory interrupts with "Failure writing inventory request", flushing downstream jobs

Reporter: ahahn
Assignee: kbraak
Type: Bug
Summary: Tapir: inventory interrupts with "Failure writing inventory request", flushing downstream jobs
Priority: Major
Resolution: WontFix
Status: Closed
Created: 2012-09-12 16:19:53.586
Updated: 2013-01-15 14:40:04.413
Resolved: 2012-09-17 09:47:10.269
Description: Harvesting of datasourceId=354 (HIT id) fails during inventory requests. The last inventory response received (attached) seems to cause a parsing error on a whitespace after the name "Eupelmus ". The following error message however refers to failure in "writing inventory request", not in "parsing inventory response. After this, the inventory operation stops without further messages, and all subsequent steps are flushed. A date is not set in the HIT user interface.

Expected behaviour:
- after a broken response, proceed to the next range and continue as normal
- display an error message pointing at the reason for the failure

Dataset: "Iziko at
Scheduled: full harvesting and synchronisation
End of log entries (note the message):

2012-09-12 15:27:44.0 	Failure writing inventory request (full version below)
2012-09-12 15:27:44.0 	 Writing to file: /mnt/fiber/super_hit//south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_response.047.gz
2012-09-12 15:27:43.0 	Executing get request...
2012-09-12 15:27:43.0 	 Writing to file: /mnt/fiber/super_hit/south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_request.047.gz
2012-09-12 15:27:43.0 	Requesting next set of records, starting at: 9400
2012-09-12 15:27:40.0 	Setting Tapir next to [9400]
2012-09-12 15:27:40.0 	 Writing to file: /mnt/fiber/super_hit//south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_response.046.gz
2012-09-12 15:27:39.0 	Executing get request...

2012-09-12 15:27:44.0 	Failure writing inventory request
org.gbif.harvest.util.tapir.TapirDataResponseHandler. handleResponse ( 64 )
org.gbif.harvest.util.tapir.TapirDataResponseHandler. handleResponse ( 18 )
org.apache.http.impl.client.AbstractHttpClient. execute ( 945 )
org.apache.http.impl.client.AbstractHttpClient. execute ( 919 )
org.apache.http.impl.client.AbstractHttpClient. execute ( 910 )
org.gbif.harvest.util.RequestUtils. executeGetRequestAndReturnDiagnostics ( 288 )
org.gbif.harvest.tapir.TapirHarvester. fireInventory ( 267 )
org.gbif.harvest.tapir.TapirHarvester. inventory ( 498 )
org.gbif.harvest.tapir.TapirHarvester. inventory ( 436 )
sun.reflect.NativeMethodAccessorImpl. invoke0 ( -2 )
sun.reflect.NativeMethodAccessorImpl. invoke ( 39 )
sun.reflect.DelegatingMethodAccessorImpl. invoke ( 25 )
java.lang.reflect.Method. invoke ( 597 )
org.apache.commons.beanutils.MethodUtils. invokeExactMethod ( 404 )
org.gbif.harvest.scheduler.SimpleScheduler. triggerMethodInvocation ( 1015 )
org.gbif.harvest.scheduler.SimpleScheduler. access$400 ( 54 )
org.gbif.harvest.scheduler.SimpleScheduler$RunnableJob. run ( 270 )
java.util.concurrent.ThreadPoolExecutor$Worker. runTask ( 886 )
java.util.concurrent.ThreadPoolExecutor$Worker. run ( 908 )
java.lang.Thread. run ( 662 )]]>
Attachment IzikoInventoryFailingResponse.htm

Created: 2012-09-14 16:39:02.643
Updated: 2012-09-14 16:39:02.643
The expected behavior during harvesting (search) is indeed "after a broken response, proceed to the next range and continue as normal"

But in inventory requests, we're iterating over pages of responses and the start parameter for the next request is dependent on a count parsed from the last response. Currently DiGIR, BioCASE, or TAPIR inventory requests that fail, kill all downstream jobs and it is on the Helpdesk to then follow up with the provider to fix their data.

I agree a more informative error message should be given, however, and I will do so.   

Comment: Thanks - in addition to that, I would propose a similar mechanism to the one we are using for broken harvesting responses: increase the setoff based on the previous one, and check whether a step-wise increase (+20? +50?) for the start value gets the process back on track again. Until fatal errors are propagated to an integrated overview that makes it easy to find them, it is not possible to follow up on these issues in an efficient way, as we have to continuously read through log files containing a lot of other things. 
Created: 2012-09-14 17:05:38.45
Updated: 2012-09-14 17:05:38.45

Created: 2012-09-14 17:18:50.356
Updated: 2012-09-14 17:18:50.356
I agree it could be nice, but currently it would be a workaround for a single dataset right? I'd be in favor of enforcing the basic requirement that providers should be able to respond with a list of their names. When this breaks off completely at least, seeing a dataset with 0 records harvested is a good flag for follow up, and the informative error message will immediately highlight why the response couldn't be processed.

Please keep in mind we are switching to the artificial name range harvesting in the next half year or so. If you'd like to keep this issue open, it will take a few weeks to attend to at least. 

Created: 2012-09-14 17:19:57.891
Updated: 2012-09-14 17:19:57.891
Ok, agreed - let's focus on bringing the error messages up to a visible level instead