Issue 12003

Incorporate changes from Tim's review of Crawler Coordinator

12003
Reporter: lfrancke
Assignee: lfrancke
Type: Task
Summary: Incorporate changes from Tim's review of Crawler Coordinator
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-10-10 15:53:39.083
Updated: 2013-12-17 16:13:00.75
Resolved: 2012-10-24 15:01:33.335
        
Description: Tim's done a review of the current state of the Coordinator which I need to work on:

{quote}
Summary:
In general I understand the logic, but have a concern.  I would expect CrawlerCoordinatorService to provide the methods for "what is crawling now", "how big the crawl queue is", "state of current crawl", "terminate crawl", "suspend crawl" etc.  I expect these are being developed in another project (José's maybe?) but that suggests scattered logic to me (and also unnecessarily exposing low level ZK API to more projects than needed).  As it is coded, this is really a CrawlQueueService, and not a coordinating service.  It is not clear how an actual crawl is started (e.g. a thread issuing requests) - assume something is watching ZK, or is something else listening to the same message? My lack of understanding here, but the docs don't provide enough info in this project to understand it (which seems like a problem).

Details:
1) CrawlerCoordinatorService.java
1.1) Apparent misuse of IllegalArgumentException
     "scheduled already" is not an illegal argument really - suggest simply swallow the request, or introduce an InegibleToCrawlException.
1.2) Suggest rename scheduleCrawl to initiateCrawl or addToQueue.  Schedule means: "Arrange or plan (an event) to take place at a particular time."
     which this does not - it is simply asking it to crawl as soon as resources allow.
1.3) I would expect CrawlerCoordinatorService to provide the methods for "what is crawling now", "how big
     the crawl queue is", "state of current crawl" etc.  I expect these are being developed in another
     project (José's maybe?) but that suggests scattered logic to me, exposing ZK API unnecessarily to more
     projects than needed.  This is really a CrawlQueueService if it stays as is, but I think should really be expanded to do more.


2) CrawlerCoordinatorServiceImpl.java
2.1) Not sure why "/crawls" and "/crawls/" and why  "/crawls" is not a final static variable
2.2) An explanation in JDoc of the Netflix usage would be helpful.  Most people won't know what those bits provide.
  e.g. This makes use of the netflix curator framework which provides...
  e.g. This makes use of the netflix DistributedPriorityQueue framework which provides...
2.3) Unintuitive logic (I was always taught bad practice to return mid-method):
{code}
    if (!endpoint.isPresent()) {
      return;
    }
    ZooKeeperCrawlStatus crawlStatus = getCrawlStatus(datasetUuid, dataset, endpoint.get());
    LOG.info("Crawling endpoint [{}] for dataset [{}]", endpoint.get().getUrl(), datasetUuid);
    byte[] dataBytes = serializeCrawlStatus(crawlStatus);
    queueCrawlStatus(datasetUuid, dataBytes);
{code}
Suggest:
{code}
    if (endpoint.isPresent()) {
      ZooKeeperCrawlStatus crawlStatus = getCrawlStatus(datasetUuid, dataset, endpoint.get());
      LOG.info("Crawling endpoint [{}] for dataset [{}]", endpoint.get().getUrl(), datasetUuid);
      byte[] dataBytes = serializeCrawlStatus(crawlStatus);
      queueCrawlStatus(datasetUuid, dataBytes);
    }
{code}
2.4) checkDataset() has lots of:
{code}
      LOG.info(message);
      throw new IllegalArgumentException(message);
{code}
    Why not catch it once in the calling method and log once, then use Preconditions

2.5) scheduleCrawl appears to have race conditions should 2 coordinators be running - a problem?

2.6) Suggest checkDataset renamed to assertEligbleToCrawl() returning void

2.7) You do a lot of log and throw and it is not immediately clear why
     http://today.java.net/article/2006/04/04/exception-handling-antipatterns#logAndThrow

2.8) getCrawlStatus() would benefit from some javadoc explaining that it is the business logic
     to unpack various content from the registry and store it in ZK for downstream work.  Specifically
     the KVP "metadata" required to actually issue crawl messages over various protocols are stored.

3) General
Project is missing readme, and it is not clear how ZK info is passed in, or how to start the service. Suspect not functional yet.{quote}]]>
    


Author: lfrancke@gbif.org
Created: 2012-10-22 14:30:44.855
Updated: 2012-10-22 14:30:44.855
        
{quote}2.5) scheduleCrawl appears to have race conditions should 2 coordinators be running - a problem?{quote}

[~trobertson@gbif.org] Could you elaborate?
    


Author: lfrancke@gbif.org
Created: 2012-10-22 14:39:22.498
Updated: 2012-10-22 14:39:22.498
        
{quote}1.1) Apparent misuse of IllegalArgumentException
"scheduled already" is not an illegal argument really - suggest simply swallow the request, or introduce an InegibleToCrawlException.{quote}

Any opinion if this exception should be checked or unchecked?
    


Author: lfrancke@gbif.org
Comment: About that exception: I partly agree. Most of the exceptions I throw really are illegal arguments (doesn't exist, no endpoints etc.) but already scheduled is a different thing so I'll add an exception for that. We also have the case where a dataset is not synchronized (no metadata sync ever ran). I'm treating that as an illegal argument as well right now because I think we should only allow things to be passed in that are actually ready to be crawled.
Created: 2012-10-22 14:42:38.595
Updated: 2012-10-22 14:42:38.595


Author: trobertson@gbif.org
Created: 2012-10-24 09:44:18.431
Updated: 2012-10-24 09:45:23.063
        
2.5) Race condition:

- checkDataset() does a checkExists for the node in ZK
- should it not be running, later on the coordinator will call curator.create().creatingParentsIfNeeded().forPath(...

Assuming checkExists() does not actually create the node (the curator JDocs don't explain this) and simply check to see if it exists already, 2 running coordinators might check the existence of the node, see it doesn't exist and then later on, both could attempt to create it.  I am not sure if this is valid or not, but something just to verify please.


1.1) I would say checked.  Justification as generally, runtimes are non recoverable indicating things like server down, some illegal runtime state has occurred, or some bug, but in this instance you would want the developer to guard against it, and actually decide if they want to swallow, or retry later etc.








    


Author: lfrancke@gbif.org
Comment: Good catch about that race condition. It'll throw an exception when that happens. It isn't a beautiful exception but the special case handling doesn't warrant any better I think. I'll add a comment in the code.
Created: 2012-10-24 14:28:34.748
Updated: 2012-10-24 14:28:34.748


Author: lfrancke@gbif.org
Comment: I disagree on the unchecked/checked thing though. This is nothing the coder can do anything about. The uuid is ultimately provided by the user so it is very recoverable. If you don't feel strongly about it I'd leave it unchecked for now.
Created: 2012-10-24 14:32:22.793
Updated: 2012-10-24 14:32:22.793