Issue 16130

CRITICAL Crawling: Machine tags broken

16130
Reporter: trobertson
Assignee: omeyn
Type: Bug
Summary: CRITICAL Crawling: Machine tags broken 
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2014-07-18 10:41:55.749
Updated: 2014-08-08 16:01:15.901
Resolved: 2014-08-08 16:01:15.852
        
Description: This should not be possible:

https://www.dropbox.com/s/sfquj19h1ix4b6p/Screenshot%202014-07-18%2018.32.12.png

Which failed in the logs with:
org.gbif.api.exception.ServiceUnavailableException: crawl_attempt tag exists more than once, clean up first and retry [823dc56e-f987-495c-98bf-43318719e30f]

Could this be related to recent  changes to machine tags?  I am not sure the registry doesn't normalize tags just for display, so I think checks in crawler need to be case sensitive.]]>


Author: trobertson@gbif.org
Created: 2014-07-18 11:04:14.635
Updated: 2014-07-18 11:04:14.635
        
Speculation: The following sequence of events could cause this:

Live registry starts using "machineTag" instead of "machinetag" as it previously was.
Crawling used an old registry webservices client and started crawling.
The crawling method[1] would not find any datasets as it looks for the wrong value, so writes a new tag with attempt 1 [which is what we saw as the second tag]

Any subsequent crawls of this dataset will fail on this line [2].

If the above sequence didn't happen as this, something similar happened, when the crawler did not correctly get the machine tags in the first lookup call, so went and wrote a new one.

[1] https://github.com/gbif/crawler/blob/master/crawler-coordinator/src/main/java/org/gbif/crawler/CrawlerCoordinatorServiceImpl.java#L350
[2] https://github.com/gbif/crawler/blob/master/crawler-coordinator/src/main/java/org/gbif/crawler/CrawlerCoordinatorServiceImpl.java#L358


Author: omeyn@gbif.org
Created: 2014-07-23 11:41:12.628
Updated: 2014-07-23 11:41:12.628
        
I _think_ this may have been two crawler coordinators running simultaneously and somehow failing to update the same tag. All of the duplicated tags (there were about 100) had timestamps a few ms different from the other existing tag(s) for that dataset. Some of the duplicates had the same count as the "original" and some wrote a new '1'. New duplicates do not appear to get created now, so lends some weight to the simultaneous coordinators theory. The created timestamps on the duplicates were also spread out over time - some in June, some July 2014, and one even in December 2013.

I've deleted the duplicates from prod registry now, but not sure what further action can be taken. My coordinators theory should really be tested in dev as a next step.


Author: omeyn@gbif.org
Comment: On reflection, my idea of duplicate coordinators is dumb - it presupposes rabbit delivering the same message twice, which it explicitly guarantees it won't do. Testing with a 2nd coord running (in dev) confirms that this is not the cause.
Created: 2014-07-24 16:30:58.378
Updated: 2014-07-24 16:30:58.378


Author: omeyn@gbif.org
Comment: Argg, this is still happening - 8 new cases in last 24 hrs. Raising back to critical.
Created: 2014-07-24 16:48:18.559
Updated: 2014-07-24 16:48:18.559


Author: omeyn@gbif.org
Comment: appears to be related to autocrawl in response to message from a new version of dataset available from an IPT. Recreated in dev.
Created: 2014-08-07 16:26:37.364
Updated: 2014-08-07 16:26:37.364


Author: omeyn@gbif.org
Comment: Added a short-lived (5 sec) cache of recently crawled datasets to the registry change listener to stop sending a bunch of crawl messages for the same dataset at once. Stops the zk error that seems to have caused the multiple machineTag creation attempts.
Created: 2014-08-08 16:01:15.899
Updated: 2014-08-08 16:01:15.899