Issue 14121

One non-unique/incomplete GBIF triplet will abort the crawling attempt

14121
Reporter: jlegind
Assignee: omeyn
Type: Bug
Summary: One non-unique/incomplete GBIF triplet will abort the crawling attempt
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-10-01 14:29:47.26
Updated: 2014-08-06 15:39:39.231
Resolved: 2014-08-06 15:39:39.154
        
Description: Even one bad triplet will fail the entire crawl.

I think that the bar should be set lower, for instance it could be a percentage rather than a single record.]]>
    


Author: kbraak@gbif.org
Created: 2013-12-12 15:24:01.64
Updated: 2013-12-12 15:24:01.64
        
Some additional background information:

For xml there are no rules, so we're only talking dwca. We only check the first 2m records of a dwca. So if that first 2m records contains a single duplicate triplet, we declare the dwca invalid. If it has bad triplets - ie incomplete then we have a threshold of i think 25% of the checked records.
So if < 25% are incomplete the dwca is valid, and they get dropped individually in the crawl.
    


Author: omeyn@gbif.org
Comment: See also http://dev.gbif.org/wiki/display/INT/Identifier+problems+and+how+to+solve+them
Created: 2013-12-12 15:26:24.622
Updated: 2013-12-12 15:26:24.622


Author: omeyn@gbif.org
Comment: this is a conscious decision
Created: 2014-08-06 15:39:39.229
Updated: 2014-08-06 15:39:39.229