Issue 12012

Redo response handling

12012
Reporter: lfrancke
Assignee: lfrancke
Type: Bug
Summary: Redo response handling
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2012-10-11 17:05:28.317
Updated: 2013-12-17 15:46:34.365
Resolved: 2012-11-15 16:32:34.039
        
Description: The current response handling is too restrictive and can fail to work properly in some situations. This issue addresses this by catching all possible failure cases.

See the attached PDF for all possible cases. This also adds handling of speculative requests, content hashing and crawl abortion.


The old description concerning just the hashing part:
The problem has two parts. We hash the content of a protocol specific _content_ element ({{content}} for BioCASe and DiGIR and {{search}} for TAPIR).

The first bug here is that the current implementation uses those elements itself as well for the calculation of the hash which is a problem for BioCASe because it has attributes in there which are likely to change with every request thus rendering this whole exercise unnecessary.

The second problem is with TAPIR because there the actual content of the {{search}} contains a {{summary}} element if paging is used (which we do at the moment). So that'd have to be stripped out somehow.

Tim suggested a different strategy in which we hash everything that's not from a specific XML namespace.]]>
    
Attachment Crawler Error handling - Sheet1.pdf


Author: lfrancke@gbif.org
Created: 2012-10-25 21:47:27.206
Updated: 2012-10-25 21:47:27.206
        
* For BioCASe we can hash everything inside of the {{content}} element making sure that it really only is everything _inside_ of it and not the element itself
* DiGIR works as it is
* TAPIR should hash everything inside of the {{search}} element that is _not_ from a TAPIR namespace