Issue 15503

Occurrence data from synonyms is not merged correctly

15503
Reporter: rdmpage
Type: Bug
Summary: Occurrence data from synonyms is not merged correctly
Priority: Critical
Status: Open
Created: 2014-04-07 19:21:14.832
Updated: 2017-03-15 09:48:05.297
        
Description: I'm finding cases where the occurrence data linked to different names that GBIF recognises as synonyms is not merged together. Regardless of name used for a taxon, surely I should get the same occurrence data?

For example, consider those two frog names:
http://www.gbif.org/species/2423028 Amietophrynus funereus (Bocage, 1866)
http://www.gbif.org/species/5217104 Bufo funereus Bocage, 1866

These names are objective synonyms, as the species pages state. But the number of occurrences on the two pages is different! If I download the Amietophrynus funereus data I get 159 georeferenced occurrences, if I download those for Bufo funereus I get only 153. The difference is the 6 occurrences labelled "Amietophrynus funereus" (ids 70123925, 70125415, 70125468, 242065371, 242065377, 242065408) that are listed in the results for Amietophrynus funereus (along with 153 labelled "Bufo funereus"). For Bufo funereus I get only the 153 labelled with that name, not the six additional occurrences with the other name.

I'm assuming that this is a bug, because regardless of the name used the number of occurrences for a taxon should be the same.]]>
    


Author: mdoering@gbif.org
Created: 2014-11-19 11:49:25.437
Updated: 2014-11-19 11:49:25.437
        
This is a long standing issue we never resolved. Currently we do match occurrences to synonym records in our backbone and allow occurrence searches to hit the expanded synonymy. So if you are on an accepted taxon page you get a link to see & download all occurrences incl all known synonyms. But if you are on a syonyms page you just get the occurrences directly linked to that synonym name. This is also true for metrics and the maps.

I never felt good about this and would have liked to see us just using the accepted taxon everywhere. And on an occurrence details page we should show the exact verbatim name as it was identified plus the accepted name it matches in the GBIF Backbone.

Giving more examples based on Aiphanes horrida and its syonyms:
You can search for synonym names separately not hitting the accepted name or other synonyms:
 - http://www.gbif.org/occurrence/search?taxon_key=5294694
 - http://www.gbif.org/occurrence/search?taxon_key=2738649
 - http://www.gbif.org/occurrence/search?taxon_key=2738639 (accepted)

Maps and occurrence metrics exist for the synonym alone:
 - http://www.gbif.org/species/5294694
 - http://www.gbif.org/species/2738649
 - http://www.gbif.org/species/2738639 (accepted)
    


Author: mdoering@gbif.org
Comment: If all occurrences are only ever matched to accepted backbone taxa a solr search for the verbatim scientific name as it came in by the publisher would be highly desirable. Therefore linked existing issue POR-2069
Created: 2014-12-01 12:32:33.443
Updated: 2014-12-01 12:32:33.443


Author: trobertson@gbif.org
Created: 2015-08-26 12:37:02.432
Updated: 2015-08-26 12:37:02.432
        
Doing so would completely remove the ability to find any occurrences that are identified by what we consider non-accepted names (we have no full text search on records).

Not disputing the issue but is loosing the ability to find e.g. records identified as "Bufo funereus"?

CC [~Donald Hobern] and [~ahahn@gbif.org] for comments as the decision for current behaviour dates back to design discussions in 2006
    


Author: mdoering@gbif.org
Created: 2015-08-26 12:40:37.656
Updated: 2015-08-26 12:40:37.656
        
I think searching occurrences by a synonym name (sensu GBIF backbone) should return the same result as when searching for the accepted name. Similar for maps, they should not be different.

That does not necessarily mean we need to match an occurrence record always to the accepted backbone usage. We could still match it to the synonym. But for searches and maps we should treat them "synonymous"
    


Author: rdmpage
Comment: [~trobertson@gbif.org]I don't know enough about GBIF's internal design, but isn't [~mdoering@gbif.org]  suggesting adding verbatim scientific to the search index, so we'd still be able to search on "Bufo funereus" and get hits?
Created: 2015-08-26 12:41:14.975
Updated: 2015-08-26 12:41:14.975


Author: mdoering@gbif.org
Comment: A simpler solution would be to strongly prompt the user when searching for a synonym that he should search for the accepted name X instead to yield all results. This gets tricky when searching for many names, especially via the API with no human interaction
Created: 2015-08-26 12:43:15.351
Updated: 2015-08-26 12:43:15.351


Author: trobertson@gbif.org
Created: 2015-08-26 12:53:39.14
Updated: 2015-08-26 12:53:39.14
        
To explain the behaviour at the moment, a record of Bufo funereus gets identified as:

[I used names here for readability, but in fact we store IDs]

scientificName: Bufo funereus (stored as nubKey)
species: Amietophrynus funereus (stored as speciesKey)

We also know Bufo funereus to be a synonym of Amietophrynus funereus

Thus a search for Ameitophrynus funereus is inclusive of records identified using synonymous names because those records have species=Amietophrynus funereus but a search for Bufo funereus does not return the other.

It comes down to the fact that our search effectively does
{code}
  WHERE kingdomKey=? OR phylumKey=? OR .... speciesKey=? OR nubKey=?
{code}

If you do an explicit search of nubKey="Amietophrynus funereus" it would not include the synonyms either.



    


Author: donald hobern
Created: 2015-08-26 13:11:55.453
Updated: 2015-08-26 13:11:55.453
        
Issue is that with CoL model of "Name --> Accepted Name", we don't know whether the relationship is as bidirectional as in this case.  We know that Felis concolor == Puma concolor (at least as far as the names themselves go), but this is not true where two species are combined.  All we know then is that anything associated with any of the old species is included in the new species, NOT the other way round.

To do what Rod wants, we would need much more semantic information around each synonym.
    


Author: mdoering@gbif.org
Created: 2015-08-26 13:15:47.935
Updated: 2015-08-26 13:16:32.638
        
As we rely on speciesKey to also hit synonyms of an accepted species this is broken in case of infraspecific synonyms that have an infraspecific accepted name.

For example for the bird synonym "Acanthis flammea holboellii" we have a single occurrence record in Asia:
http://www.gbif.org/species/5231633

There are over 56.000 occurrences for the accepted subspecies "Carduelis flammea flammea" so its hard to verify if they include the synonym record:
http://www.gbif.org/species/6175375

If you zoom into the area of the synonym and do a search for the accepted name you see that you do not yield that record:
http://www.gbif.org/occurrence/search?TAXON_KEY=6175375&HAS_GEOSPATIAL_ISSUE=false&GEOMETRY=76+44%2C76+66%2C132+66%2C132+44%2C76+44

The same geometry with the synonym taxonKey does return the record:
http://www.gbif.org/occurrence/search?TAXON_KEY=5231633&HAS_GEOSPATIAL_ISSUE=false&GEOMETRY=76+44%2C76+66%2C132+66%2C132+44%2C76+44


    


Author: mdoering@gbif.org
Created: 2015-08-26 13:36:46.21
Updated: 2015-08-26 13:36:46.21
        
[~Donald Hobern] your example suggests we need to know taxon concepts and how they relate. But I'm not convinced this is what everyone would expect. If the formerly accepted and now merged name is now a synonym in our backbone, why should a search based on that name return results according to the former accepted concept? It is tricky for pro parte synonyms (names that got split, not merged), but luckily they are still rare and I believe we should not worry about them too much right now. For plain synonyms we should treat them simply as a different label to the currently accepted concept.

Just for reference, the GBIF backbone knows about unspecific, heterotypic, homotypic or pro parte synonym:http://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/TaxonomicStatus.html

Obviously we do not track taxon concepts properly and do not make a difference between the same name being used before or after it was split for example. It would be rather impossible also to match the simple occurrence information we receive to such concepts. That needs much richer occurrence data and will be a long journey.
    


Author: rdmpage
Created: 2015-08-26 13:36:49.11
Updated: 2015-08-26 13:36:49.11
        
Maybe one approach is to separate names and taxa, in the sense that a search for a given name (text string) returns two sets of occurrences, one being those occurrences expressly labelled with that name, the other being how GBIF interprets that name (i.e., the taxon concept according to the GBIF backbone). If these sets are not the same, then the map for that search could show both sets using different colours. I could imagine all sorts of interesting possibilities for discover that might result from this, such as subspecies that GBIF regards as synonyms of a species appearing in a subset of the species range.

Perhaps for a search on an accepted name, the default is to show the range of accepted name and all synonyms in same colour, but then have an option to colour occurrences by verbatim name on occurrence. This could actually be pretty cool and useful.
    


Author: donald hobern
Created: 2015-08-26 14:00:42.747
Updated: 2015-08-26 14:00:42.747
        
Don't think I was suggesting returning names according to the old narrow concept.  Splits are a much bigger problem for us.

Recognise that we have a vocabulary for status, but impossible at this stage to be certain of this in majority of cases.