Taxon match fuzzy considered harmful
Summary: Taxon match fuzzy considered harmful
Created: 2014-09-02 15:55:15.251
Updated: 2015-04-02 11:43:55.502
Resolved: 2015-04-02 11:43:55.472
Description: It might be time to consider NOT using fuzzy matching for taxa, because the possibility for error is large, especially if we just rely on dumb string matching.
For example, http://www.gbif.org/species/2702715 is listed as "Monerma P. Beauv.", a genus of grass regarded as a synonym of "Hainardia" . All 14,006 occurrences for this "grass" are in the sea and come from the NODC WOD01 Plankton Database! WTF!?
Turns out these occurrences have the scientificName "Monera (bacteria)" and no other taxonomic information. These are unidentified bacteria. Yet somehow these have been matched to "Monerma" (I guess because that's one character edit away from "Monera"), and now we have 14,006 bacteria pretending to be flowering plants!
Can we avoid doing this? Can we require more than a string match between scientific names (say, by also looking at higher classification, and if one is lacking, simply don't try and match). These sorts of errors tend to have a trivial cause, but the effects for users can be pretty major.]]>
Created: 2015-04-02 11:42:44.156
Updated: 2015-04-02 11:42:44.156
The new species match algorithm is much stricter and does return no match in those cases now:
The species matching basically works by creating an overall matching confidence score based on the following parts:
- name string match: only straight matches return a good confidence. As soon as fuzzy matching is involved it requires further matching confidence (mostly classification based, see next)
- classification match: string matching of the different higher taxon names. We apply this maintained synonym map to all ranks to get better coverage of alternative at least partly concept overlapping names: http://rs.gbif.org/dictionaries/synonyms/
- rank match: this is has only minimal impact mostly to match higher "homonym" taxa properly
- status match: we slightly favor accepted names and punish doubtful taxa a little.
- next best match: in cases when there is no other close match we increase the confidence
The whole procedure is a little complex and pretty sensitive to weighting the different parts.
Here are some important entries to the code base:
name matching method:
calculate overall matching score for all considered matches (max 50, retrieved from a very fuzzy lucene query):
The list of considered matches shows up as "alternatives" in the matching response if you set verbose=true:
I will try hard to better document the matching procedure in the future, though the fine details keep changing