Issue 13584

Merge species by their epithet and authorship

13584
Reporter: mdoering
Assignee: mdoering
Type: Improvement
Summary: Merge species by their epithet and authorship
Priority: Critical
Resolution: Fixed
Status: Resolved
Created: 2013-08-14 17:17:27.849
Updated: 2016-07-04 14:16:41.284
Resolved: 2015-09-18 20:30:56.094
        
Description: To avoid creating multiple accepted taxa based on the same holotype we should try to detect original name relations based on the authorship and the species epithet. Try to spot recombinations of the same original name within a family by looking at the epithet and authorship & year.

See http://iphylo.blogspot.de/2013/08/cluster-maps-papaya-plots-and-trouble.html#disqus_thread
and http://gist.neo4j.org/?a91e351279438d2ec1e6]]>
    

Attachment CDM_Agoseris_apargioides_synonymy.png


Attachment CDM_Agoseris_grandiflora_synonymy.png


Attachment CDM_Lactuca_aurea_synonymy.png

Attachment asteraceae.txt
Attachment aves.txt
Attachment curculionidae.txt
Attachment molossidae.txt
Attachment muridae.txt


Author: mdoering@gbif.org
Created: 2015-08-25 22:08:44.366
Updated: 2015-08-25 22:33:47.254
        
I am exploring the largest families in plants, Asteraceae, and animals, Curculionidae. A file for each family is attached which lists all species or infraspecific names in the current GBIF backbone that share the same terminal epithet. On first glance it appears safe to assert a basionym relation if the basionym author of a recombination is the same as the primary author of a name with the same epithet.

In addition I am attaching a family file for 2 Mammals families, the largest rodent family Muridae and the bat family used by Rod in his gists above, Molossidae:
http://www.gbif-uat.org/species/5510
http://www.gbif-uat.org/species/5719

For reference the SQL executed:
{noformat}
\copy (select coalesce(infra_specific_epithet,specific_epithet) as epithet, string_agg(scientific_name, '|' order by bracket_authorship, scientific_name) from name_usage u join name n on name_fk=n.id where dataset_key='d7dddbf4-2cf0-4f39-9b2a-bb099caae36c' and specific_epithet is not null and family_fk=5719 group by epithet having count(*) > 1) to 'molossidae.txt';
{noformat}
    


Author: mdoering@gbif.org
Created: 2015-08-25 22:24:55.268
Updated: 2015-08-25 22:24:55.268
        
There are many cases in zoological names where the actual basionym (protonym) is not known. But we can assert several names all refer to the same, unknown protonym and therefore share the type specimen and should be synonymous.

For example:
{quote}
Chaerephon bemmeleni (Jentink, 1879)
Tadarida bemmeleni (Jentink, 1879)
Chaerophon bemmeleni
{quote}

{quote}
Chaerephon bivittata (Heuglin, 1861)
Tadarida bivittata (Heuglin, 1861)
Chaerophon bivittata
{quote}

In this case we will create a temporary placeholder basionym during the nub build which will be removed before it is exported to postgres (cause we do not know the name)


Other cases are crystal clear:
{quote}
Zyzomys woodwardi (Thomas, 1909)
Laomys woodwardi Thomas, 1909
{quote}

{quote}
Mesocricetus raddei (Nehring, 1894)
Cricetus nigricans subsp. raddei
Cricetus raddei Nehring, 1894
{quote}

{quote}
Peromyscus polionotus (Wagner, 1843)
Mus polionotus Wagner, 1843
{quote}

In the following group only the Eversmann group should be created, i.e. Microtus obscurus and Mus obscurus:
{quote}
Microtus obscurus (Eversmann, 1841)
Cricetulus obscurus (Milne-Edwards, 1867)
Bolomys obscurus (Waterhouse, 1837)
Mus obscurus Eversmann, 1841
Praomys obscurus Hutterer & Dieterlen, 1992
{quote}

    


Author: mdoering@gbif.org
Created: 2015-08-25 22:31:44.203
Updated: 2015-08-25 22:31:53.343
        
An example from PESI which contains a basionym group with different (infraspecific) ranks:
http://www.eu-nomen.eu/portal/taxon.php?GUID=74D5F715-C3BB-4273-A7EE-88B9967C912C

{quote}
Centaurea phrygia subsp. abbreviata (K. Koch) Dostál   [ACCEPTED]
Centaurea salicifolia subsp. abbreviata K. Koch   [BASIONYM]
Centaurea abbreviata (K. Koch) Hand.-Mazz.
Jacea abbreviata (K. Koch) Soják
Centaurea phrygia subsp. abbreviata (K. Koch) Dostál
{quote}

    


Author: mdoering@gbif.org
Created: 2015-08-25 22:51:20.149
Updated: 2015-08-25 22:51:54.892
        
Looking at all names in class Aves (which has had many name changes and the GBIF backbone is built on competing source bird classifications) one can see that basionyms do occur across families. See the Ridgway, 1893 names:

{noformat}
abbotti

Muscicapidae
 Copsychus malabaricus subsp. abbotti
 Luscinia svecica subsp. abbotti

Nectariniidae
 Cinnyris abbotti
 Cinnyris souimanga subsp. abbotti

Psittacidae
 Cacatua sulphurea abbotti (Oberholser, 1917)
 Psittacula alexandri abbotti (Oberholser, 1919)
 Psittinus cyanurus abbotti Richmond, 1902

Sulidae
 Papasula abbotti (Ridgway, 1893)
 Sula abbotti Ridgway, 1893

Sylviidae
 Malacocincla abbotti Blyth, 1845
 Malacocincla abbotti subsp. abbotti

Threskiornithidae
 Threskiornis bernieri abbotti (Ridgway, 1893)
 Threskiornis aethiopicus subsp. abbotti
{noformat}

But as a new feature we prefer to stay on the safe side and only group names within the same family
    


Author: mdoering@gbif.org
Created: 2015-08-26 09:36:35.202
Updated: 2015-08-26 09:36:35.202
        
A bit more troublesome are these cases found in Aves families when the year slightly differs (plus the order of the authors which should be handled gracefully)

{quote}
aequatorialis: Thraupidae
Tangara arthus aequatorialis (Taczanowski & Berlepsch, 1885)
Dacnis lineata aequatorialis Berlepsch & Taczanowski, 1884
{quote}
The above subspecies appear to be distinct taxa with different types:
http://avibase.bsc-eoc.org/species.jsp?avibaseid=BBE57B94BF75B977
http://avibase.bsc-eoc.org/species.jsp?avibaseid=303530B0EEC98857

{quote}
aequatorialis: Trochilidae
Heliodoxa rubinoides aequatorialis (Gould, 1860)
Androdon aequatorialis Gould, 1863
Campylopterus largipennis aequatorialis Gould, 1861
{quote}

We will leave those cases unresolved for now to not overly eagerly create synonyms programmatically.
    


Author: rdmpage
Created: 2015-08-26 10:36:09.458
Updated: 2015-08-26 10:36:24.574
        
[~mdoering@gbif.org] Great to see progress on this, it's clearly not easy. The birds Tangara arthus aequatorialis and Dacnis lineata aequatorialis are different, their descriptions are in http://biostor.org/reference/108278 and http://biostor.org/reference/99650, respectively.

For the plant name Centaurea phrygia subsp. abbreviata it's interesting that IPNI doesn't have basionym links :(
    


Author: mdoering@gbif.org
Created: 2015-08-26 16:05:46.094
Updated: 2015-08-26 16:05:46.094
        
New BasionymSorter class created to group basionyms from a list of names with tests covering most of the above:

https://github.com/gbif/checklistbank/blob/master/checklistbank-common/src/test/java/org/gbif/checklistbank/authorship/BasionymSorterTest.java#L31

https://github.com/gbif/checklistbank/commit/9ae8f809de671d632be4596e17df64614dcd4051#diff-c92a56042d4c37f7e598e599fdf2ea49R26
    


Author: mdoering@gbif.org
Created: 2015-08-27 17:23:53.866
Updated: 2015-08-27 17:23:53.866
        
Assuming we find several accepted species names in a basionym group of the GBIF backbone and we identified a primary accepted name by using the most trusted source, what needs to happen with the other accepted name(s) in that group?

For now the GBIF backbone will change their status to Doubtful and raise an issue flag STATUS_DERIVED.

Alternatively we could try to automatically convert it into a homotypical synonym. But that would lead to subsequent problems to deal with, primarily what to do with the potential child species or infraspecies. If we relink the children to the primary accepted name they might need to be recombined into a new genus or species and we might see all sorts of nomenclatoral issues only a human can properly resolve.
    


Author: rdmpage
Created: 2015-08-27 18:44:53.228
Updated: 2015-08-27 18:44:53.228
        
In an ideal world the nomenclatural issues would be computable. If you know the types, and the dates of publication, then the names to use follow automatically.

Maybe another way to tackle this is to have clusters of names that are in some sense related, and if a user searches for one of these they arrive at that cluster. I need to think this through a bit more, but I envisage something like a suffix tree of names which could be used to generate all the possible names one might encounter (e.g., species names with different generic names, inclusion of subgenera, suspicious, etc.). I think if we decouple recognising associated sets of names from assertions about which one is accepted, we could avoid some of these problems.
    


Author: mdoering@gbif.org
Created: 2015-08-27 18:53:47.628
Updated: 2015-08-27 18:53:47.628
        
What would happen to subspecies of a species which should be a synonym because it belongs to the basionym group (which we called "nomenclatural group" in the days with Dave: https://code.google.com/p/gbif-ecat/wiki/ChecklistBank#Nomenclatural_Group) ?

Should these subspecies also be synonymized? I guess not as they should be based on a different type
    


Author: rdmpage
Comment: I'd need to play with an example, but I wonder if there's any way to postpone making a decision? Can we not simply say, these names are associated in some way so that user discovers information associated with related names, but avoids the strong assertion that a name is accepted or not. If we separated names and taxa, this would be a fairly easy thing to do I suspect...
Created: 2015-08-27 19:01:42.21
Updated: 2015-08-27 19:01:42.21


Author: mdoering@gbif.org
Created: 2015-08-27 20:27:24.249
Updated: 2015-08-27 20:27:24.249
        
We will basically create this discoverable cluster of names by having the basionym relation established. Then one can see the list of all names in such a group from the basionym and potentially vice versa.

BUT in order to include these names in occurrence searches, on maps or statics these need to by synonyms in our backbone. Thats how the system works and I think it makes sense that way. It actually gives the fuzzy term synonym some concrete meaning in GBIF.

PS: Note that I keep calling it basionym cause Im coming from the botanical world. Think of it as protonyms, chresonyms if you prefer that.
    


Author: mdoering@gbif.org
Created: 2015-08-28 10:46:59.318
Updated: 2015-08-28 10:55:05.257
        
A good visualization of "homotypical groups" within synonym list is often found in botanical literature. The CDM software of the BBM does a pretty nice job to show which names are all based on the same type. I have attached a few screenshots of extensive synonymies:
http://cichorieae.e-taxonomy.net/portal/cdm_dataportal/taxon/469b48a7-a2c9-4769-bd69-49b68674ba72/synonymy
http://cichorieae.e-taxonomy.net/portal/cdm_dataportal/taxon/209399b6-0d3c-4f5a-9f0d-b49ebe0f9403/synonymy
http://cichorieae.e-taxonomy.net/portal/cdm_dataportal/taxon/ccd1ceaf-c100-44a4-ba36-3f83bfed86e6/synonymy

This synonymy list is gigantic: http://cichorieae.e-taxonomy.net/portal/cdm_dataportal/taxon/7b3f0f40-63f2-44a4-a72b-6a8f49dd430f/synonymy

    


Author: mdoering@gbif.org
Created: 2015-09-10 15:06:08.61
Updated: 2015-09-10 15:06:08.61
        
If there is an accepted name Cichorium intybus L. with a synonym Cichorium glabratum C. Presl
If then in another source there is an accepted recombination Cichorium intybus subsp. glabratum (C. Presl) Arcang.

Should the subspecies be accepted in the backbone or become a synonym as the primary source we trust more does not accept C. glabratum which is the basionym based on the same type for the subspecies? I would think so

    


Author: mdoering@gbif.org
Created: 2015-09-18 20:30:36.157
Updated: 2015-09-18 20:30:36.157
        
Implemented here: https://github.com/gbif/checklistbank/blob/master/checklistbank-cli/src/main/java/org/gbif/checklistbank/nub/NubBuilder.java#L737

test:
https://github.com/gbif/checklistbank/blob/master/checklistbank-cli/src/test/java/org/gbif/checklistbank/nub/NubBuilderTest.java#L106

based on these 2 sources:
https://github.com/gbif/checklistbank/blob/master/checklistbank-cli/src/test/resources/nub-sources/dataset25.txt
https://github.com/gbif/checklistbank/blob/master/checklistbank-cli/src/test/resources/nub-sources/dataset26.txt