Issue 11837

Tapir: inventory interrupts with "Failure writing inventory request", flushing downstream jobs

11837
Reporter: ahahn
Assignee: kbraak
Type: Bug
Summary: Tapir: inventory interrupts with "Failure writing inventory request", flushing downstream jobs
Priority: Major
Resolution: WontFix
Status: Closed
Created: 2012-09-12 16:19:53.586
Updated: 2013-01-15 14:40:04.413
Resolved: 2012-09-17 09:47:10.269
        
Description: Harvesting of datasourceId=354 (HIT id) fails during inventory requests. The last inventory response received (attached) seems to cause a parsing error on a whitespace after the name "Eupelmus ". The following error message however refers to failure in "writing inventory request", not in "parsing inventory response. After this, the inventory operation stops without further messages, and all subsequent steps are flushed. A date is not set in the HIT user interface.

Expected behaviour:
- after a broken response, proceed to the next range and continue as normal
- display an error message pointing at the reason for the failure

example
Dataset: "Iziko at http://hit.gbif.org/datasource/list.html?filter=provider&value=south+africa
Scheduled: full harvesting and synchronisation
End of log entries (note the message):

2012-09-12 15:27:44.0 	Failure writing inventory request (full version below)
2012-09-12 15:27:44.0 	 Writing to file: /mnt/fiber/super_hit//south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_response.047.gz
2012-09-12 15:27:43.0 	Executing get request...
2012-09-12 15:27:43.0 	 Writing to file: /mnt/fiber/super_hit/south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_request.047.gz
2012-09-12 15:27:43.0 	Requesting next set of records, starting at: 9400
2012-09-12 15:27:40.0 	Setting Tapir next to [9400]
2012-09-12 15:27:40.0 	 Writing to file: /mnt/fiber/super_hit//south_african_national_biodiversity_institute-c5f7ef70/7b468209-8d5c-477b-b118-cdf04dde9912/inventory_response.046.gz
2012-09-12 15:27:39.0 	Executing get request...
(...)

stacktrace:
2012-09-12 15:27:44.0 	Failure writing inventory request
org.gbif.harvest.util.tapir.TapirDataResponseHandler. handleResponse ( TapirDataResponseHandler.java: 64 )
org.gbif.harvest.util.tapir.TapirDataResponseHandler. handleResponse ( TapirDataResponseHandler.java: 18 )
org.apache.http.impl.client.AbstractHttpClient. execute ( AbstractHttpClient.java: 945 )
org.apache.http.impl.client.AbstractHttpClient. execute ( AbstractHttpClient.java: 919 )
org.apache.http.impl.client.AbstractHttpClient. execute ( AbstractHttpClient.java: 910 )
org.gbif.harvest.util.RequestUtils. executeGetRequestAndReturnDiagnostics ( RequestUtils.java: 288 )
org.gbif.harvest.tapir.TapirHarvester. fireInventory ( TapirHarvester.java: 267 )
org.gbif.harvest.tapir.TapirHarvester. inventory ( TapirHarvester.java: 498 )
org.gbif.harvest.tapir.TapirHarvester. inventory ( TapirHarvester.java: 436 )
sun.reflect.NativeMethodAccessorImpl. invoke0 ( NativeMethodAccessorImpl.java: -2 )
sun.reflect.NativeMethodAccessorImpl. invoke ( NativeMethodAccessorImpl.java: 39 )
sun.reflect.DelegatingMethodAccessorImpl. invoke ( DelegatingMethodAccessorImpl.java: 25 )
java.lang.reflect.Method. invoke ( Method.java: 597 )
org.apache.commons.beanutils.MethodUtils. invokeExactMethod ( MethodUtils.java: 404 )
org.gbif.harvest.scheduler.SimpleScheduler. triggerMethodInvocation ( SimpleScheduler.java: 1015 )
org.gbif.harvest.scheduler.SimpleScheduler. access$400 ( SimpleScheduler.java: 54 )
org.gbif.harvest.scheduler.SimpleScheduler$RunnableJob. run ( SimpleScheduler.java: 270 )
java.util.concurrent.ThreadPoolExecutor$Worker. runTask ( ThreadPoolExecutor.java: 886 )
java.util.concurrent.ThreadPoolExecutor$Worker. run ( ThreadPoolExecutor.java: 908 )
java.lang.Thread. run ( Thread.java: 662 )]]>
    
Attachment IzikoInventoryFailingResponse.htm


Author: kbraak@gbif.org
Created: 2012-09-14 16:39:02.643
Updated: 2012-09-14 16:39:02.643
        
The expected behavior during harvesting (search) is indeed "after a broken response, proceed to the next range and continue as normal"

But in inventory requests, we're iterating over pages of responses and the start parameter for the next request is dependent on a count parsed from the last response. Currently DiGIR, BioCASE, or TAPIR inventory requests that fail, kill all downstream jobs and it is on the Helpdesk to then follow up with the provider to fix their data.

I agree a more informative error message should be given, however, and I will do so.   
    


Author: ahahn@gbif.org
Comment: Thanks - in addition to that, I would propose a similar mechanism to the one we are using for broken harvesting responses: increase the setoff based on the previous one, and check whether a step-wise increase (+20? +50?) for the start value gets the process back on track again. Until fatal errors are propagated to an integrated overview that makes it easy to find them, it is not possible to follow up on these issues in an efficient way, as we have to continuously read through log files containing a lot of other things. 
Created: 2012-09-14 17:05:38.45
Updated: 2012-09-14 17:05:38.45


Author: kbraak@gbif.org
Created: 2012-09-14 17:18:50.356
Updated: 2012-09-14 17:18:50.356
        
I agree it could be nice, but currently it would be a workaround for a single dataset right? I'd be in favor of enforcing the basic requirement that providers should be able to respond with a list of their names. When this breaks off completely at least, seeing a dataset with 0 records harvested is a good flag for follow up, and the informative error message will immediately highlight why the response couldn't be processed.

Please keep in mind we are switching to the artificial name range harvesting in the next half year or so. If you'd like to keep this issue open, it will take a few weeks to attend to at least. 
    


Author: ahahn@gbif.org
Created: 2012-09-14 17:19:57.891
Updated: 2012-09-14 17:19:57.891
        
Ok, agreed - let's focus on bringing the error messages up to a visible level instead