Issue 17227

IPT update is NOT triggering crawls

17227
Reporter: trobertson
Assignee: mdoering
Type: Bug
Summary: IPT update is NOT triggering crawls
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2015-02-12 21:04:44.091
Updated: 2015-02-17 11:42:32.474
Resolved: 2015-02-17 11:42:32.443
        
Description: I just had a conference call with VertNet (Laura and John W).  Here is what we observed:

2 out of 3 times when Laura hit publish we saw:
  i. the DOI was minted on the dataset page
  ii. the “publication date” was NOT set to today
  iii. the crawl was not initiated

The update button in the IPT is not triggering the desired behaviour.  I manually clicked “crawl” in the registry and it did do exactly what we would expect.  1 out of 3 times this happened automatically.

I had Rabbit consoles open and the systems were all idle during this period.  The registry is not emitting "crawl me now" messages on IPT updates correctly.  Perhaps it is incorrectly looking for changes in metadata?]]>
    


Author: trobertson@gbif.org
Created: 2015-02-13 14:50:40.642
Updated: 2015-02-13 14:50:40.642
        
This commit hopefully will go a long way to solving this.  https://github.com/gbif/registry/commit/a83b52d8070de80a5345cfc1d876ca2688388bda

The issue is a race condition.
Changes to the dataset table immediately trigger crawling by broadcasting a message through rabbit.
However, immediately after changing the dataset, the endpoints are deleted and updated (e.g. the second step of syncing the POSTed update XML from an IPT).
The crawling fires up, and finds no endpoints, as they have just been deleted.
When the endpoints are added, the dataset is scheduled for recrawl.
The second schedule however is discarded as the crawler coordinator decides it has just done it anyway, and assumes it is incorrect that repeated calls to crawl are coming in.

This shows in the logs (Ignoring update...) as:
{code}
[crap@bla6 logs]$ grep 3ad882bb-cd21-4201-8b83-3684bfc6d830 *log
registry-change_stdout.log:INFO  [2015-12-02 20:13:50,524+0100] [pool-7-thread-1] org.gbif.occurrence.cli.registry.RegistryChangeListener: Sending crawl for updated dataset [3ad882bb-cd21-4201-8b83-3684bfc6d830]
registry-change_stdout.log:INFO  [2015-12-02 20:13:50,923+0100] [pool-7-thread-1] org.gbif.occurrence.cli.registry.RegistryChangeListener: Ignoring update of dataset [3ad882bb-cd21-4201-8b83-3684bfc6d830] because either no crawlable endpoints or we just sent a crawl
registry-change_stdout.log:INFO  [2015-12-02 20:13:51,126+0100] [pool-7-thread-1] org.gbif.occurrence.cli.registry.RegistryChangeListener: Ignoring update of dataset [3ad882bb-cd21-4201-8b83-3684bfc6d830] because either no crawlable endpoints or we just sent a crawl
registry-change_stdout.log:INFO  [2015-12-02 20:20:18,386+0100] [pool-7-thread-1] org.gbif.occurrence.cli.registry.RegistryChangeListener: Sending crawl for updated dataset [3ad882bb-cd21-4201-8b83-3684bfc6d830]
[crap@bla6 logs]$
{code}



    


Author: trobertson@gbif.org
Created: 2015-02-17 11:42:32.472
Updated: 2015-02-17 11:42:32.472
        
We have implemented an embargo period, and LR managed to go through 57 datasets all of which triggered indexing.

While a rare race condition still exists, I believe this is now robust enough to considered an acceptable situation unless this appears frequently again.  We can re-crawl at any point so rare conditions can be handled.