Issue 14343

Crawl automatically on dataset updates

14343
Reporter: kbraak
Assignee: omeyn
Type: Improvement
Summary: Crawl automatically on dataset updates
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-11-08 11:49:56.924
Updated: 2013-12-17 15:46:38.143
Resolved: 2013-12-17 15:43:05.889
        
Description: I realize the crawler may not be listening for update events yet as a precautionary measure.

This issue serves as a reminder that automatic crawling on dataset updates still needs to be turned on.

It should be smart enough, that if the same dataset gets updated before crawling has finished, a second crawl should never happen in parallel. The first crawl should be aborted and the second started, or the first one is just allowed to finish. Imagine a scenario whereby eBird can get republished hourly, but the crawl takes 100 hours to complete.

Automatic crawls also need to work nicely with the automatic crawl scheduling service, proposed in CRAWLER-21]]>
    


Author: omeyn@gbif.org
Comment: in production - multiple crawls of same thing have always been disallowed. Scheduling a crawl for something already running will get rejected.
Created: 2013-12-17 15:43:05.919
Updated: 2013-12-17 15:43:05.919