Issue 12015

Handling of duplicate hashes needs to be implemented

12015
Reporter: lfrancke
Assignee: lfrancke
Type: Bug
Summary: Handling of duplicate hashes needs to be implemented
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-10-11 17:36:55.539
Updated: 2013-12-17 15:46:35.642
Resolved: 2012-11-15 16:32:31.226
        
Description: For every response we get we calculate a hash and compare it to the previous response. Currently I only log if we get the same content twice which is clearly wrong as it might indicate endless looping.

What do we want to do?

* skip to the next scientific name range
* skip to the next page that would follow
* skip one record at a time to see if it "gets better"
* retry a few times and then abort crawl

I don't know what the usual patterns are that we see here so I'm not sure what makes sense to do.]]>
    


Author: lfrancke@gbif.org
Comment: [~trobertson@gbif.org] and [~kbraak@gbif.org] any opinions on this?
Created: 2012-10-24 22:18:05.863
Updated: 2012-10-24 22:18:05.863


Author: kbraak@gbif.org
Created: 2012-10-25 14:20:18.634
Updated: 2012-10-25 14:20:18.634
        
There could be a case (1) where the wrapper tool doesn't honor paging parameters. The next response always says end of records is false, and the crawlers continue receiving the exact same response, despite having incremented the start param. In this case, the best you can harvest is the first response from each range so just "skip to the next scientific name range".

There could be another case (2) where the wrapper tool doesn't honor the name range filter. Maybe it doesn't understand greater than / less than for example, and the same page of results is returned for each name range. In this case, it won't help incrementing the range, the best you can harvest is all the records from a single range so just "retry a few times and then abort crawl"

The case where we would want to "skip one record at a time to see if it gets better" is when we encounter a response with a bad character (XML breaking character, ie &). In this case, you narrow down the page size to localize the record(s) with the bad response.

I wonder if when using artificial name ranges, if you have to be committed to always running through AAA -> zzz though?

For example, knowing the number of records expected (target count from metadata), if the crawler's harvested count = expected count, you could then terminate the harvest. Indeed, it is possible for small datasets that a single name range contains all the dataset's records.