Australian National Insect Collection dataset seriously buggered, mate
16603
Reporter: rdmpage
Type: Bug
Summary: Australian National Insect Collection dataset seriously buggered, mate
Priority: Major
Resolution: Fixed
Status: Resolved
Created: 2014-10-30 19:01:56.654
Updated: 2014-10-31 12:04:58.673
Resolved: 2014-10-31 11:38:07.67
Description: Something has gone horribly wrong with the Australian National Insect Collection dataset, see http://www.gbif.org/occurrence/search?datasetKey=beed1b50-8c73-11dc-aaed-b8a03c50a862
We now have 191,793 occurrences of the taxon "Csiro Medvedev & Lawrence, 1984" which is totally fictitious. Looks like something has gone horribly wrong. The verbatim record for one record suggests the keys and values don't match. If this "verbatim" value is the result of parsing the occurrence.csv file in the Darwin Core Archive either that file is broken, or the parser is broken.
Do we not have tools to catch when things like this happen?
{
"associatedSequences": null,
"basisOfRecord": "00",
"catalogNumber": "8c811258-4c0a-457e-9a11-ea54ebef2b86",
"class": "Arthropoda",
"collectionCode": "32-031532-152",
"continent": null,
"coordinatePrecision": "-110.866668701172",
"coordinateUncertaintyInMeters": null,
"country": null,
"county": "Arizona",
"dateIdentified": null,
"day": "08",
"decimalLatitude": null,
"decimalLongitude": "31.7166690826416",
"eventDate": null,
"eventID": "4",
"eventTime": "WGS84",
"family": "Hymenoptera",
"genus": "Formicidae",
"geodeticDatum": "urn:biolink.anic.ento.csiro.au:Event:022C7C68-2B37-4F5C-AFC0-F89C4C7DD905",
"id": "dr130|CSIRO|ANIC|32-031532-152",
"identificationQualifier": null,
"identifiedBy": "PreservedSpecimen",
"individualCount": null,
"infraspecificEpithet": "clarus",
"institutionCode": "ANIC",
"kingdom": "Species",
"locality": null,
"locationRemarks": null,
"maximumDepthInMeters": null,
"maximumElevationInMeters": "10000",
"minimumDepthInMeters": null,
"minimumElevationInMeters": null,
"month": "1963",
"occurrenceRemarks": null,
"order": null,
"phylum": "Animalia",
"recordNumber": null,
"recordedBy": "Odontomachus clarus Roger",
"scientificName": "CSIRO",
"specificEpithet": "Odontomachus",
"stateProvince": "United States",
"taxonRank": "Beatty,J.A.",
"vernacularName": null,
"year": "Madera Can. [Canyon], Pima Co. [County]"
}]]>
Author: rdmpage
Comment: OK, this looks like a GBIF bug. I've grabbed the Darwin Core archive, parsed the file, and it looks fine. So, I'm guessing that there's a bug in the GBIF parser for comma-delimited CSV files.
Created: 2014-10-30 21:45:22.058
Updated: 2014-10-30 21:45:22.058
Author: trobertson@gbif.org
Created: 2014-10-31 07:18:24.554
Updated: 2014-10-31 07:18:24.554
The archive was corrupt but now it is fixed.
It's crawling now, so should be online within an hour.
Author: rdmpage
Created: 2014-10-31 07:59:15.702
Updated: 2014-10-31 07:59:15.702
Thanks Tim. However, I'm puzzled - does this mean that the archive I grabbed last night (UK time) was new (and hence should have been parsed OK), or was it old and "corrupted", in which case how come the parsing errors weren't caught? The files in the archive I grabbed were 28 September 2014.
I see the old records are still live in the portal, so now we have both good and bad records from the same data set in GBIF. What happens to the old ones, do they get culled? If so, this means with this issue and POR-2492 we've had to delete several 100,000 records in the last few days.
Author: trobertson@gbif.org
Created: 2014-10-31 11:37:34.209
Updated: 2014-10-31 11:37:50.015
Just to re-iterate what I believe you already know - The GBIF.org site is an index to the data mobilized by the participants of GBIF and not a curated database - it is analogous to the Google index in some sense. In order to update existing records in the index, we need to match existing records in the index with those coming in - the persistently identifying records requirement. When data are mangled (like in this case) and when people change records, it results in new records in the index and the older ones get removed from the index as the identifiers are changed. To get an idea of how often this happens, there are 520M records or so in the index, but the current IDs are in the range of 1.03 billion. If data were stably identified, then the number of records would be very close to the highest identifier.
As far as I can see this process is all complete for ANIC [1] so I will close this issue, and thanks for reporting it. Please can you continue to use the feedback mechanisms in place? - this Jira system, and the email addresses for the data manager and technical contacts are the most constructive way to ensure issues are addressed properly.
[1] http://www.gbif.org/dataset/beed1b50-8c73-11dc-aaed-b8a03c50a862
Author: rdmpage
Created: 2014-10-31 12:04:58.673
Updated: 2014-10-31 12:04:58.673
I take your point, but I don't think you can really claim to be a "dumb index" like Google (which, of course, isn't dumb at all given how much semantic processing it does). In this case pretty obvious crap was ingested and displayed, and all I'm suggesting is that it should have been possible to spot that something was wrong (lat,long pairs of null and a float are a bit of a give away).
People changing ids between indexing is another issue, obviously they shouldn't do that, but in some cases (such as POR-2492 ) the identifier field was the same, just the collection code had changed. The provider had done what they were supposed to.
I get that this is challenging stuff, that resources are limited - and I'm on your side (even if you might not feel like it), but it's a continuing frustration to discover cases where there are large numbers of record (10^5 - 10^6) that are problematic. What's worse, is that I seem to be the first to spot these, which is deeply worrying (is nobody else looking?).