Issue 17858

Skip absence-data on indexing

17858
Reporter: kbraak
Type: Bug
Summary: Skip absence-data on indexing
Priority: Blocker
Resolution: Duplicate
Status: Closed
Created: 2015-10-02 21:37:06.686
Updated: 2017-03-02 15:53:05.938
Resolved: 2017-03-02 15:53:05.86
        
Description: "Our current index is not able to interpret absence-data... For the time being, a published record to us means occurrence, ergo presence, not absence - DM-178

Publishers may be publishing occurrence records with absence-data in the following ways:
- [occurrenceStatus|http://rs.tdwg.org/dwc/terms/#occurrenceStatus]=absent
- [individualCount|http://rs.tdwg.org/dwc/terms/#individualCount]=0
- [organismQuantity|http://rs.tdwg.org/dwc/terms/#organismQuantity]=0

Until we can better interpret absence-data, we should add a rule to prevent absence-data from being indexed.

]]>
    


Author: mdoering@gbif.org
Comment: It would be nice to keep at least a counter for these absent records at the dataset level that can be shown
Created: 2016-01-06 15:43:10.633
Updated: 2016-01-06 15:43:10.633


Author: peterdesmet
Comment: Here's a dataset to test it on: http://www.gbif.org/dataset/2b2bf993-fc91-4d29-ae0b-9940b97e3232
Created: 2016-01-22 14:24:26.132
Updated: 2016-01-22 14:24:26.132


Author: mblissett
Created: 2016-01-29 12:20:46.953
Updated: 2016-01-29 12:20:46.953
        
Another example:

The map for http://www.gbif.org/species/2479407 shows species all over Europe, when it should be limited to Spain.

The records come from the dataset http://www.gbif.org/dataset/c779b049-28f3-4daf-bbf4-0a40830819b6, which has 1339711 'occurrences'.  In fact, 983396 of them are absences.

It's obviously useful data, but until we handle it properly it needs to be hidden.  I'll look into it.

    


Author: mdoering@gbif.org
Created: 2016-01-29 13:27:56.083
Updated: 2016-01-29 13:27:56.083
        
Some food for thought:

It is simple to spot those during interpretation, but we should probably already avoid creating a verbatim record? Or even not a fragment? Or interpretation removes the verbatim and/or fragment?

Alternatively we could keep all records incl absent ones, interpret the OccurrenceStatus from the different options listed above and then avoid them in cubes & maps and offer an OccurrenceStatus filter in solr. Maybe the cube should even just have another dimension present/absent. 
    


Author: peterdesmet
Created: 2016-02-05 10:12:25.454
Updated: 2016-02-05 10:12:25.454
        
I would strongly prefer the second option, for 4 reasons:

1) Those records are useful (e.g. for modelling).

2) This is a representation issue. One could have them in the occurrence store, but remove them from maps (like how none georeferenced records are not shown), or as you suggest: create a filter/dimension for them and maybe shown them on maps in a different colour.

3) If absent records are removed during interpretation, it becomes much more difficult for the publisher to assess why the number of records in their source is different from the number of records presented on GBIF. That difference should indicate an issue in the process (e.g. records removed at the source are not deleted by GBIF, ...), not because valid absent records were ignored.

4) Absent records are going to become more ubiquitous, especially if GBIF can handle them nicely.
    


Author: kbraak@gbif.org
Created: 2016-02-05 19:15:09.391
Updated: 2016-02-05 19:15:09.391
        
Short-term, GBIF has decided to go with the first option meaning we will exclude absence data from the index. Ideally, we'd still be able to show the number of absence records skipped though (to account for the difference in the number of records presented on GBIF).

Long-term we'll be aiming to implement the second option, and handle absence data nicely. 
    


Author: mdoering@gbif.org
Created: 2016-02-08 11:10:39.585
Updated: 2016-02-08 11:10:39.585
        
[~kbraak@gbif.org], I know we aimed for that at short term, but when looking into the implementation details I seriously wonder if it is of much less work than doing it the other way. We should have a detailed discussion about this before we start any implementation.

But we should get sth done rather soon, this is a pretty prominent problem
    


Author: mdoering@gbif.org
Created: 2016-02-08 12:18:23.907
Updated: 2016-02-08 12:18:36.07
        
As for counts shown in the portal I think we should:

 - show all counts on http://www.gbif.org/
 - show all & georeferenced presence count on http://www.gbif.org/occurrence/
 - use presence only counts for all the metrics below the main search block on http://www.gbif.org/occurrence/
 - show all counts in dataset details
 - show detailed breakdwon by presence & absence on dataset stats page
 - include all records in a default search, but allow an occurrenceStatus enum based filter
 - occurrenceStatus enum currently has 7 values which we might want to reduce to just present, absent, doubtful: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/OccurrenceStatus.java#L23

    


Author: rdmpage
Comment: Just added links to two recent feedback issues that are due to absence being interpreted as presence [PF-2349] and [PF-2350]. Was about to go WTF!? when I recalled this discussion. Users are noticing that pretty much any European bird distribution in GBIF is now gibberish.
Created: 2016-02-10 00:16:24.912
Updated: 2016-02-10 00:16:24.912


Author: rdmpage
Comment: Oh, and I think the Atlantic seabird record issue with dataset http://www.gbif.org/dataset/5a12d9c3-9465-4107-bb39-2cb451e6cef6 [DM-281] is another example of this. The "occurrenceStatus" field has values such as "null", "absent", and "present", all of which seem to be interpreted as presence, and so the Atlantic is now full of erroneous seabird records :(
Created: 2016-02-10 00:28:52.366
Updated: 2016-02-10 00:28:52.366


Author: mblissett
Created: 2016-02-10 12:34:53.892
Updated: 2016-02-10 12:34:53.892
        
The reporter of the problem with Limicola falcinellus has replied to me, and noticed that since they're using GBIF maps they have the wrong data on their portal: https://www.beachexplorer.org/arten/recurvirostra-avosetta/verbreitung

Broken / confusing counts seem much less of a problem than thousands of absence records, especially on the big bird datasets.  Maybe we should throw them out really soon.
    


Author: mblissett
Created: 2016-02-10 18:10:15.597
Updated: 2016-02-10 18:10:39.991
        
Adding some numbers to this, so we can decide what short-term thing we should do:

Four datasets use occurrenceStatus=absent:
c779b049-28f3-4daf-bbf4-0a40830819b6
2b2bf993-fc91-4d29-ae0b-9940b97e3232
5a12d9c3-9465-4107-bb39-2cb451e6cef6
ba0c046d-52bb-4262-a495-652988c9f3f7
accounting for 1 million records.  983k of these are from the first (European breeding birds), 15k from the second (
Zomerganzen - Summering geese...)

Irregular, doubtful and rare are used by four datasets, for just 1723 occurrences, 1617 of them doubtful in d6cc311c-c5ab-4f23-9a20-10514f9eb9c4.

There are a few other values, but they don't account for many occurrences.

For individualCount, there are about 10k values of -1 in 169fa761-2fb9-4022-93bd-e22b7a062efd, and less than 250 other negative values.

181 datasets use a value of 0, with Artdata 38b4c89f-584c-41bb-bd8f-cd1def33e92f having 13M, the European breeding birds having its 983k, and c779b049-28f3-4daf-bbf4-0a40830819b6 (863k), 197908d0-5565-11d8-b290-b8a03c50a862 (472k), e5080aa7-f479-41f0-bb89-278feafd3cda (118k). It's probably 13M + 983k plus about 2 million from the other datasets.


    


Author: rdmpage
Comment: Another issue is what do people do with the data when it's downloaded? Even if, say, GBIF hides absences on the maps, if that data is in the download then we are assuming that users are sufficiently savvy to handle absences correctly. if not, this could have dire consequences for naive efforts to model distributions (e.g., seabirds in the Alps).
Created: 2016-02-10 19:39:58.973
Updated: 2016-02-10 19:39:58.973


Author: mdoering@gbif.org
Comment: For downloads and similarly for API search/download calls for example via the R package we should simply default to exclude absence data, i.e. set an occurrenceStatus filter by default to include only presence. This way someone will only see absence data when they actively request it by changing the filter.
Created: 2016-02-10 21:16:21.99
Updated: 2016-02-10 21:16:21.99


Author: thirsch@gbif.org
Comment: This is repeating points already made in this thread, but just to demonstrate that the issue is being noticed, a prominent user of GBIF has pointed out the misleading occurrence points given for the Bearded Vulture (Gypaetus barbatus) due to absence records being included from http://www.gbif.org/dataset/c779b049-28f3-4daf-bbf4-0a40830819b6. I would therefore second Matthew's view that a short term fix for this would be highly desirable.
Created: 2016-02-12 09:51:14.681
Updated: 2016-02-12 09:51:14.681


Author: mdoering@gbif.org
Created: 2016-02-12 10:21:11.564
Updated: 2016-02-12 10:21:11.564
        
ABCD does not have a standard way of publishing absence data I was told by the BGBM folks. There is a chance the generic measurements section in ABCD contains counts=0 in abcd:DataSets/DataSet/Units/Unit/Gathering/SiteMeasurementsOrFacts/SiteMeasurementOrFact/MeasurementOrFactAtomised/LowerValue

Nothing about counts in the current concept survey: http://www.biocase.org/whats_biocase/concept_survey.cgi
    


Author: rdmpage
Created: 2016-02-12 11:09:14.391
Updated: 2016-02-12 11:09:14.391
        
Once this problem is resolved it would be nice to think about ways to automatically catch these sorts of problems, rather than wait for users to scratch there heads and wonder why GBIF is producing strange maps.

The user reports are of the form "species x occurs here, GBIF has records outside that range, so there's a problem". Surely we could automate this by having a set of test cases (e.g., well know species) with polygons representing their range, and every so often (say whenever new data sets are added) we run those tests checking for problems. Given that most absence data is likely to come from well-sampled regions (e.g., Europe) and include well known taxa for which we have good distributional data (e.g., red list taxa) I'd imagine we could catch a lot of these sorts of problems quickly. The data sets we're discussing here contain high visibility taxa (e.g., birds) that would make simple test cases.

More generally, I'm struck by how the testing culture in software development hasn't made it into the data handling world. Obviously there are different challenges, but I think we could do a lot more in this area. See for example this talk https://speakerdeck.com/arfon/predicting-the-future-of-publishing by Arfon Smith
    


Author: thirsch@gbif.org
Comment: Having discussed this with [~Donald Hobern], we feel that until we can deploy even the shorter-term solution on dealing with absences, we should immediately hide the records from http://www.gbif.org/dataset/c779b049-28f3-4daf-bbf4-0a40830819b6 , since this single dataset seems primarily responsible for the misleading maps relating to bird occurrences in Europe. Then once the fix is in place we can reinstate that dataset so that only the presence records show up. 
Created: 2016-02-12 13:56:38.89
Updated: 2016-02-12 13:56:38.89


Author: mblissett
Created: 2016-02-12 16:43:15.782
Updated: 2016-02-12 16:43:15.782
        
Rather than removing the whole EBBA dataset, I removed all absence records from it.

The map for *Gypaetus barbatus* is now reasonable: http://www.gbif.org/species/2480649

(I haven't deleted the 13M absence records in Artdata.)
    


Author: thirsch@gbif.org
Comment: Wow that was quick! Great thanks.
Created: 2016-02-12 17:07:24.781
Updated: 2016-02-12 17:07:24.781


Author: kbraak@gbif.org
Comment: At the [8th EU Nodes workshop/meeting in Libson 18-21 April 2016|http://www.gbif.pt/EuropeanNodesMeeting] several participants reported that they noticed absence data been shown on GBIF.org. In the context of publishing sample event data, I instructed them to advise publishers in their network not to publish absence data through GBIF.org until we can handle it properly
Created: 2016-05-02 16:11:22.878
Updated: 2016-05-02 16:11:22.878


Author: kbraak@gbif.org
Comment: Transferred to https://github.com/gbif/portal16/issues/308 Closing issue. 
Created: 2017-03-02 15:53:05.914
Updated: 2017-03-02 15:53:05.914