Create something to save the verbatim result we received from crawls
12677
Reporter: lfrancke
Assignee: lfrancke
Type: Improvement
Summary: Create something to save the verbatim result we received from crawls
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-01-29 14:28:53.658
Updated: 2013-12-17 15:46:39.002
Resolved: 2013-02-06 11:44:15.428
Description: This could either be a new MessageListener or a CrawlListener in the Crawler project. Every response including request + response (+ headers) should be saved.
This would live on our NAS.
/////...
I suggest at least three files:
* Request we sent
* Response payload we received
* Response headers we received]]>
Author: trobertson@gbif.org
Created: 2013-01-29 17:20:21.434
Updated: 2013-01-29 17:20:21.434
+1 for within the project (possibly disabled if no path is configured). Seems like basic core functionality I would expect from the core crawling library.
The directory structure looks painful to my eyes (so many directories with tiny files). Suggest using files instead (like prototype did):
{code}
///
_.req
_.resp
{code}
e.g.
{code}
/caff27c3-151a-4268-8629-4803011de3ad/12/
aaa-aab-0_0.req
aaa-aab-0_0.resp
aaa-aab-1000_0.req // second page in the range, first attempt
aaa-aab-1000_0.resp
aaa-aab-1000_1.req // second page in the range, second attempt (e.g. the first must have failed)
aaa-aab-1000_1.resp
{code}
would be more user friendly to browse on the filesystem
Author: lfrancke@gbif.org
Comment: Unfortunately we don't have access to the headers we received or the exact request right now. This might warrant another issue but those would be bigger changes. For now we only save the responses.
Created: 2013-02-04 17:20:53.805
Updated: 2013-02-04 17:20:53.805
Author: lfrancke@gbif.org
Created: 2013-02-04 17:25:28.881
Updated: 2013-02-04 17:28:24.956
I've changed the layout to
{code}
////___.response
{code}
Having it all in one directory per attempt could lead to tens of thousands of files.