Issue 11608

Metrics necessary for a Dataset

11608
Reporter: jcuadra
Assignee: lfrancke
Type: Task
Summary: Metrics necessary for a Dataset
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-07-17 12:02:22.397
Updated: 2013-12-17 16:13:01.096
Resolved: 2012-10-23 21:49:19.687
        
Description: In here we display all the metrics necessary for a Dataset that we can easily capture _during crawling and interpretation_.

- Target count
- Max Records Harvested
- Records Harvested
- Records Dropeed
- Harvesting Start (date)
- Last Harvested (date)
- Occurrence IDs for all Occurrence that failed interpretation in the last crawl
-- TODO : define what is meant here (total failure, geo failure etc?)
- Availability/Uptime statistics & history for a Dataset (or Technical Installation?)]]>


Author: trobertson@gbif.org
Comment: I would suggest separating current harvest metrics, from "last successful harvest" metrics
Created: 2012-07-17 12:31:02.252
Updated: 2012-07-17 12:31:02.252


Author: lfrancke@gbif.org
Created: 2012-07-17 12:32:26.34
Updated: 2012-07-17 12:32:26.34
        
* _Dataset Name, Provider, URL, Endorsing Node_ are not really Metrics
* _Target Count_ is coming from the Metadata Synchronizer, for which protocols does this exist?
* _Max Records Harvested_ is an all time maximum?
* _Records Harvested, Records Dropped_ is for the last crawl I assume and keeping historical records
* How do _Harvesting Start (date)_ and _Last Harvested (date)_ differ?

Author: ahahn@gbif.org
Created: 2012-07-17 12:51:24.747
Updated: 2012-07-17 12:53:30.718

- _Dataset Name, Provider, URL and Endorsing Node_ are needed for overview, sorting and filtering
- confirm: _Max Records Harvested_ is indeed an all time maximum, to determine whether we ever were better than in the last run (e.g., did we manage to harvest all previously, but now failed, or got fewer records). It would need an easier reset function, though, as very old numbers there are of limited value
- confirm: _Records Harvested_ is from the last crawl, and used for testing against the expected number. _Records Dropped_ is calculated from _Target Count - Reords Harvested_; this assumes that a metadata update (to update the _Target Count_) has been run before the crawling, or else the calculated value may be bogus. Might consider not calculating if date crawled < date metadata updated or date crawled > x time after metadata updated(?)
- _Harvesting Start_ is a running process, which moves to _Last Harvested_ when finished. Currently, this only refers to the crawl process. For a data publisher, it would be more meaningful if it referred to the whole process chain, i.e. "arrived in index database".

The last point would also concern other metrics: it might make sense to break counts down into "records harvested", "records synchronised", "records processed", as the combination helps us to identify at which steps records are lost, while a data publisher is mainly interested to know when and how their records are publicly accessible.


Author: lfrancke@gbif.org
Created: 2012-07-17 14:01:42.339
Updated: 2012-07-17 14:01:42.339
        
This issue is purely meant to collect metrics about a certain dataset so doesn't include any Metadata.

{quote}Aggregate counts (historical too) for issues: Communication, Protocol and Content (BoR missing, Coordinates, ...) issues.{quote}

Which kind of aggregation do we need/want? Are three counters enough or do we want to separate this further?


Author: mdoering@gbif.org
Created: 2012-07-17 15:55:50.25
Updated: 2012-07-17 16:05:35.468
        
Is this also meant for checklist datasets?
When Im managing the checklist bank index it is often interesting to know:
 - the dataset subtype (its always of type checklist in that case of course)
 - the types of dwca extensions mapped
 - nub/catalogue of life coverage metrics, i.e. percentage of names also found in nub / col


Author: jcuadra@gbif.org
Created: 2012-07-18 15:48:09.631
Updated: 2012-07-18 15:48:59.814
        
Markus, I have come up with a mockup for checklists, a rather simple one of course.
My knowledge does not go too far deep into this subject, so may I ask you to suggest any other like ultra-important fields to be added?

On your point "nub/catalogue of life coverage metrics" could it be worth just putting a placemark on the main checklist table (on the mock) and have an auxiliary window show all the COL-coverage metrics?

nub/catalogue of life coverage metrics

Checklist Mock: http://dev.gbif.org/wiki/display/POR/Crawler+UI+-+Checklist+Datasets

(btw, I don't know really how to format the "mapped extensions" into that table...)


Author: mdoering@gbif.org
Created: 2012-07-18 19:20:18.753
Updated: 2012-07-18 19:22:31.866
        
For checklists, which are always dwc archives, the most important information right now is the dwca validator.
For example see this report: http://tools.gbif.org/dwca-reports/200-8519773639146970007.html

Pretty much all of the information there is used to identify if an archive is ready to be indexed/imported. From those the most important informations probably are:

1. For every data file, most importantly for the core:
  - number of records, i.e. rows in the data file
  - number of empty rows, i.e. with no content or whitespace only
  - number of rows with different column counts than expected (from header row or meta.xml)

2. Are core ids unique? If not a short list of sample duplicates should be accessible

3. Is referential integrity good, i.e. "foreign keys" pointing to the coreID? If not, a breakdown of which term has problems with some examples of keys pointing to a non existing core id:
  - core taxon file: parentNameUsageID, acceptedNameUsageID, originalNameUsageID
  - for every extension file the key column to the coreID

4. Is taxonomic tree good?
  - parent taxa are always accepted, never synonyms
  - accepted taxa of synonyms are always accepted, never synonyms

5. Eml exists & is parsable




Something not yet existing but very valuable would be a check of names with the name parser to see how many of them are parsable


Author: lfrancke@gbif.org
Created: 2012-07-18 19:49:55.445
Updated: 2012-07-18 19:49:55.445
        
At the moment I'm only interested in Metrics that are easily collected during crawling and interpretation (of a single record). So that would be stuff like HTTP errors, bad XML, number of total records, number of bad interpretations etc.

All of this only for Occurrence Datasets.

Everything else can/needs to be implemented later.


Author: lfrancke@gbif.org
Comment: I'll make sure to collect these things on a Wiki page and then close this issue as it's not really a suitable issue to actually work on.
Created: 2012-10-22 14:52:26.292
Updated: 2012-10-22 14:52:26.292