Issue 18191

Unindexed Irish dataset

Reporter: kylecopas
Assignee: jlegind
Type: Feedback
Summary: Unindexed Irish dataset 
Status: InProgress
Created: 2016-01-31 17:09:38.837
Updated: 2016-02-25 14:32:06.466
Description: This NBDC plant dataset looks to be identified properly as occurrences, but the 17,000+ available at source are not appearing:]]>

Author: rdmpage
Created: 2016-01-31 19:49:45.837
Updated: 2016-01-31 19:49:45.837
This seems to be a recurring them, data comes in, gets display don portal regardless of whether it's been interpreted successfully. Nobody looking at the portal page can tell what happened, nor whether an issue (such as this one has been reported). I'm guessing this is a dataset that doesn't define unique triples to identify occurrences (?)

It would be great if we could make the import process more transparent, flag data sets like this that haven't been parsed properly, and display issues on the portal page. There seems to be too many cases where we rely on curious people associated with GBIF to spot errors. I'm also a little alarmed that the data publishers themselves don't say something in cases like this. Do they get notified when the data is loaded? If they got a link saying "thanks for the data, it's been indexed, and has [insert number] occurrences, would you mind taking a look to see if it's been parsed OK?" I wonder if we could catch more of these cases?

Author: kylecopas
Created: 2016-01-31 20:55:15.454
Updated: 2016-01-31 20:55:15.454
I think we do catch them, Rod, and not simply from curiosity (though that doesn't hurt).

I don't think there's anything particularly typical or alarming about this dataset. It's old and hasn't been updated since 2014. It's coming from a non-IPT system that's had some ongoing bugs that NBDC knows exist and has committed to fixing (though this one does not reflect that bug).

It's probably not cause for despair. As I think you know, there's quite a lot of current activity and planning around the entire issue of data quality.

Author: rdmpage
Created: 2016-01-31 21:12:39.56
Updated: 2016-01-31 21:12:39.56
[] I guess I'm trying to avoid despair :) It's just that often when I look at a dataset (say a new one announced on Twitter) I see problems, problems that could be caught before the data was published, or at least flagged on the portal page. Instead we have the data set published as is, and if I was a data provider I might react along the lines of "how come my data has no occurrences, or you've only got 1000 or them, or all my taxa are unrecognised", etc.). As a user I might look at a dataset and go "huh?, how come the MCZ has a dataset with no data?"

Many of these issues are not so much data quality (in the sense of is the data any good) but how we parse data, handle cases where the publisher has made assumptions that don't match those of GBIF (e.g., taxa in Russian, lack of unique identifiers). If you compare the way people handle software development (issues flagged, visible next to the code in GitHub, tests for things being OK publicly visible as "badges") and data (here's the data, make of it want you want) then I think we've some way to go.

Created: 2016-02-01 12:03:13.076
Updated: 2016-02-01 12:03:13.076
[~rdmpage] There are plans on exposing the dataset validation report on those datasets that failed this initial step of the indexing process. This will give users an indication of why there are no occurrences, or why the resource has not been updated in a while.
Such addition to the portal could be augmented with a service that sends an email to the dataset contacts with the validation report.
There is no timetable on this yet, but 2nd half of 2016 looks realistic.

Reports containing a brief rundown of issues (number of species names not parsed, geo-referencing concerns, temporal issues etc.) could be mailed to the publisher as well, but we would need to have a wider discussion on this before committing to anything.  

Comment: This dataset has an issue with bad record identifiers and I have contacted them about it.
Created: 2016-02-01 15:24:23.455
Updated: 2016-02-01 15:24:23.455