Issue 18436

Show reason last crawl attempt failed on dataset page

18436
Reporter: kbraak
Assignee: bko
Type: Improvement
Summary: Show reason last crawl attempt failed on dataset page
Priority: Unassessed
Status: Open
Created: 2016-04-27 10:30:35.964
Updated: 2017-10-10 15:21:27.976
        
Description: The crawl history maintained in the registry is insufficient. The root of the failure usually needs to be extracted from the Kibana logs, or determined manually.

If a crawl attempt fails, where the reason is unknown, the dataset page could show a status "Failed - under investigation".

After determining the root of the failure, the GBIF Data Manager [~jlegind] could push a more detailed status update such as: "Failed - missing unique occurrenceIDs" ]]>
    


Author: mblissett
Created: 2016-04-29 18:00:11.232
Updated: 2016-04-29 18:02:38.743
        
The logs could be extracted using the Elasticsearch API.  Queries can be built and exported from Kibana.

I don't know how reliable this would be — would useful messages be the most recent ones?

{code}
curl -XGET 'http://kibana2.gbif.org/logstash-2016.04.29/_search?pretty' -d '{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "fquery": {
                "query": {
                  "query_string": {
                    "query": "datasetKey:(\"d5162873-89e0-40c7-8472-2e735c2443fd\")"
                  }
                },
                "_cache": true
              }
            }
          ]
        }
      }
    }
  },
  "size": 10,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "ignore_unmapped": true
      }
    },
    {
      "@timestamp": {
        "order": "desc",
        "ignore_unmapped": true
      }
    }
  ]
}'
{code}

Extract from result:

{code}
      "_source" : {
        "message" : "DwC-A for dataset [d5162873-89e0-40c7-8472-2e735c2443fd] not modified. Crawl finished",
        "@version" : "1",
        "@timestamp" : "2016-04-29T15:44:19.611Z",
        "type" : "dwca-downloader",
        "host" : "130.226.238.174:50460",
        "path" : "org.gbif.crawler.dwca.downloader.DwcaCrawlConsumer",
        "priority" : "INFO",
        "logger_name" : "org.gbif.crawler.dwca.downloader.DwcaCrawlConsumer",
        "thread" : "QueueBuilder-6",
        "log_timestamp" : 1461944660575,
        "attempt" : "3",
        "datasetKey" : "d5162873-89e0-40c7-8472-2e735c2443fd"
      },
{code}
    


Author: bko@gbif.org
Comment: Seems feasible to work on. For now, is the "logstash-2016.04.29" in the URL something that can be determined programmatically?
Created: 2016-05-06 11:09:44.144
Updated: 2016-05-06 11:09:44.144


Author: hoefft
Comment: we show the last failed crawl date, but since this is apparently insufficient I will leave this open. When there is an API or the registry is updated with more detailed information we can expose this on the website
Created: 2017-10-10 15:21:27.976
Updated: 2017-10-10 15:21:27.976