15817
Reporter: feedback bot
Type: Improvement
Summary: No link to occurrences
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2014-06-03 16:38:31.159
Updated: 2017-10-09 17:05:22.016
Resolved: 2017-10-09 17:05:21.987
Description: This is a checklist with occurrence records that are indexed. Propose that the header show the number of occurrences and a button to "view occurrences.
]]>
Author: mdoering@gbif.org
Created: 2014-06-04 11:34:15.141
Updated: 2014-06-04 17:44:33.632
[~trobertson@gbif.org] I assume in addition to the species counts and button in the header, correct?
This might be a design/space problem.
Should we only show this when occ counts are more than zero?
Author: mdoering@gbif.org
Created: 2015-03-09 16:28:38.56
Updated: 2015-03-09 16:29:22.709
Ive attached to version where we show both number of occurrences and taxa based on this link:
http://www.gbif.org/dataset/62307f4a-1073-4e79-9aa0-276c454e5264
The problem is the blue "view occurrences" or "View species" button which we cannot place twice there as we do not have enough room.
[~trobertson@gbif.org], [~kylecopas] How about we remove the big blue button and just use the number and/or subtext to link to occurrences and species? If we do this I think we should do this in all datasets, not just checklists with occurrences.
Author: mdoering@gbif.org
Comment: Another recent example is this Pensoft dataset: http://www.gbif.org/dataset/3a8a3458-675f-46f1-abbc-6e089059e5e8
Created: 2015-08-24 11:16:42.245
Updated: 2015-08-24 11:16:42.245
Author: rdmpage
Created: 2015-08-26 12:38:16.749
Updated: 2015-08-26 12:38:16.749
[~trobertson@gbif.org] [~mdoering@gbif.org] I think this discussion obscures two distinct issues:
1. What to do with DwCA that have both taxa and occurrences, but have meta.xml files that specify taxa at the centre of the star
2. Problems with Pensoft DwCA (can we not have a simple way to catch these issues/errors?)
Regarding (1), if Pensoft publishes a single DwCA file with taxa as core, are the occurrences now indexed? If so, then we can have a single DwCA for both taxa and occurrences (instead of duplicated files that differ simply in what data type is regarded as core).
If both taxa and occurrences are indexed, then surely the same Pensoft dataset should appear as both a source of occurrences and checklists? For example, for Apanteles conanchetorum http://www.gbif.org/species/1268445 we should see "Streamlining the use of BOLD specimen data to record sp…" should be in both "Appears in" columns (it is only in the "checklists" column).
Regarding (2) looking at the Pensoft DwCA I think the problem here is that while the taxa.csv file lists both taxonID and scientificName, e.g. 4153-sp2;"Apanteles conanchetorum Viereck, 1917"; the occurrences.csv file has only the taxonID column populated, not scientificName. Assuming that GBIF only indexes occurrences on scientificName, and doesn't make the connection between taxonID (i.e., when it finds taxonID it doesn't go to taxa.cv to find the scientificName) then we have the reason these occurrences are not being picked up, Pensoft has not included them in the occurrence.cv file.
Author: trobertson@gbif.org
Created: 2015-08-26 12:43:34.227
Updated: 2015-08-26 12:43:34.227
1. Yes. That is what happens to the data (occurrences are indexed) and this issue is about how we are representing this in the UI
2. No, the indexing merges the taxon information and occurrence information to complete the record correctly.
Author: mdoering@gbif.org
Comment: I also wondered if we should not list a checklist dataset with occurrences under both Appears in sections. Then we can show the number of occurrences in one section and the actual used name in the checklist one
Created: 2015-08-26 12:56:01.632
Updated: 2015-08-26 12:56:01.632
Author: rdmpage
Created: 2015-08-26 12:58:42.647
Updated: 2015-08-26 13:00:37.637
[~trobertson@gbif.org] "No, the indexing merges the taxon information and occurrence information to complete the record correctly."
*Um, no*. Take a look at http://www.gbif.org/occurrence/1058268534, which is an occurrence from the dataset in question (I found this via Viktor Senderov's email where he explained that if I go to the dataset page, click on stats http://www.gbif.org/dataset/3a8a3458-675f-46f1-abbc-6e089059e5e8/stats and then the occurrences metrics -- all "unknown" so we can get them with one link, there got to be a better way).
Note that GBIF says 'name which can't be interpreted". I'm assuming that this is because there's no name supplied. So I don't think GBIF is making the connection between taxonID in the two files (if it did, the occurrences would surely have the name that is linked to the taxonID in the taxa.csv file?).
Note that I don't mind that GBIF hasn't made the link as such, ideally PenSoft would have included the scientificName filed in the occurrences. What does bug me is that the Pensoft/GBIF links aren't working, and this seems to be because Pensoft and GBIF are making assumptions about what data GBIF accepts and how it is processed, and these don't seem to be true.
Author: rdmpage
Comment: [~trobertson@gbif.org][~mdoering@gbif.org] Perhaps another way to tackle this is have some simple tests, based on a simple parsing of the DwCA file. That is, parse the file, work out how many taxa, how many occurrences, references, etc. should be associated with the data file, then query GBIF via the API to see if that's what we get back for that dataset. For the example dataset at hand, we'd quickly see that zero occurrences for the data, and for each taxon is not what we'd expect...
Created: 2015-08-26 13:13:14.703
Updated: 2015-08-26 13:13:14.703
Author: mdoering@gbif.org
Created: 2015-08-26 13:42:34.846
Updated: 2015-08-26 13:50:14.233
[~rdmpage] the unmatched pensoft records in the stats are actually taxa, so the core records. We definitely merge the information and I think that is very much right. The core record acts as a base information which is inherited for every extension record.
The problem with that pensoft dataset seems to be a temporary systems outage or some other transient bug that should be fixed if we reindex that dataset. Ill reindex the dataset right now.
Indeed something wrong happens in the occurrence indexing part of that dataset. All but one name gets matched to our backbone inside checklistbank, but none of the occurrences are matched.
Kibana logs contains lots of "Fail to parse backbone name null for occurrence 1058269020: BLACKLISTED" from org.gbif.occurrence.processor.interpreting.TaxonomyInterpreter
Author: rdmpage
Comment: [~mdoering@gbif.org] OK, in which case shortly we should see the Pensoft occurrences light up on this map http://www.gbif.org/species/1268445 ? Currently there are 30 occurrences, these should increase by 95...
Created: 2015-08-26 13:47:31.627
Updated: 2015-08-26 13:47:31.627
Author: rdmpage
Comment: [~mdoering@gbif.org] So, it looks like the parser is struggling with null values for names, and there's no logic to defer to taxonID to retrieve the name? This is what makes DarwinCore Archive so much fun, figuring out the intention of the archive's creator ;)
Created: 2015-08-26 14:03:59.254
Updated: 2015-08-26 14:03:59.254
Author: mdoering@gbif.org
Created: 2015-08-26 14:15:08.104
Updated: 2015-08-26 14:15:08.104
Think I found the cause. The archive has mapped dwc:scientificName in both the core and the extension. But only the core actually has data, all taxonomic fields in the extension are NULL as you can see here:
http://tools.gbif.org/dwca-reports/238-1674793054563698680.html
When we merge the extension data with the core we do not check for null values right now and replace the values with null:
https://github.com/gbif/crawler/blob/master/crawler-cli/src/main/java/org/gbif/crawler/dwca/fragmenter/StarRecordSerializer.java#L105
We are fixing this as I type and will reindex the dataset later today
Author: rdmpage
Created: 2015-08-26 14:20:15.892
Updated: 2015-08-26 14:20:15.892
[~mdoering@gbif.org] Cool! So, if the NULL values are handled correctly, is the expectation that it will then use taxonID in the occurrences extension to link to the name in the core record, and viola we get the records?
Author: trobertson@gbif.org
Created: 2015-08-26 14:46:49.407
Updated: 2015-08-26 14:46:49.407
With this commit[] and a release of crawler 0.20 the issue identified in this commentary thread should be fixed. Please see
http://www.gbif.org/occurrence/search?DATASET_KEY=3a8a3458-675f-46f1-abbc-6e089059e5e8
and the increase of records in http://www.gbif.org/species/1268445 which now show on the map (there are still some commits pending to flush, but at the top level zooms they are there).
https://github.com/gbif/crawler/commit/5eb6e3040518099889fbef2e7bd5bb5b169b1ac7
This does not address the original issue though, which is still valid.
Author: rdmpage
Created: 2015-08-26 15:01:01.614
Updated: 2015-08-26 15:01:01.614
[~trobertson@gbif.org] [mdoering@gbif.org] Great, thanks for getting this working!
Now, is this a good time to mention that some of these newly indexed records duplicate existing records ;)
Seriously, do we have an issue/feature request for this anywhere? If not, I could write something up that documents the problem (e.g., http://www.gbif.org/occurrence/924981336 and http://www.gbif.org/occurrence/1058268610 ).