Issue 12972

During crawling of new datasets some records are updated or unchanged - should all be new

12972
Reporter: omeyn
Assignee: omeyn
Type: Bug
Summary: During crawling of new datasets some records are updated or unchanged - should all be new
Description: eg the herp demo dataset 84e823d2-f762-11e1-a439-00145eb45e9a
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2013-03-07 15:40:07.157
Updated: 2013-12-17 15:16:47.138
Resolved: 2013-04-02 10:53:29.171


Author: omeyn@gbif.org
Comment: Was a problem with HolyTriplet validation that improperly accepted an empty string as valid, thus revealing that the herp dataset records have no collectionCode and are therefore un-saveable.
Created: 2013-03-08 13:36:42.379
Updated: 2013-03-08 13:36:42.379


Author: omeyn@gbif.org
Comment: That didn't solve it all - at least one dataset showing unchanged going up when records are definitely different (c585e6fb-fd76-426e-ae01-a32dc9de5689)
Created: 2013-03-08 14:51:22.224
Updated: 2013-03-08 14:51:22.224


Author: omeyn@gbif.org
Comment: For c585 hbase shows correct record count in main table and secondary index, suggesting problem is in message handling and/or zk layer.
Created: 2013-03-08 15:44:05.205
Updated: 2013-03-08 15:44:05.205


Author: omeyn@gbif.org
Comment: Not just in dwca - seen in crawl of tapir endpoint c1a13bf0-0c71-11dd-84d4-b8a03c50a862
Created: 2013-03-11 16:35:02.418
Updated: 2013-03-11 16:35:02.418


Author: omeyn@gbif.org
Created: 2013-04-02 10:53:29.2
Updated: 2013-04-02 10:53:29.2
        
After deeper investigation it appears that every dataset that shows this behaviour is in fact flawed in ways described on the crawler wiki page [1] where we've been tracking the results of crawl testing. Usually these flaws are in the form of invalid or duplicate triplets.

[1] http://dev.gbif.org/wiki/display/INT/Crawler