Issue 10180

Search for "Aves" shows no aves but plant species

10180
Reporter: mdoering
Assignee: mdoering
Type: Bug
Summary: Search for "Aves" shows no aves but plant species
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2011-11-08 18:42:38.736
Updated: 2013-12-09 14:01:15.505
Resolved: 2011-11-17 15:50:37.984
        
Description: A search for birds with "Aves should bring back as the first result the nub record for Aves:
Search:
http://staging.gbif.org:8080/portal-web-dynamic/species/search?q=Aves
Aves:
http://staging.gbif.org:8080/portal-web-dynamic/species/212

For example look at the ecat dev portal search results:
http://ecat-dev.gbif.org/search?q=Aves&rkey=1

Full matches must rate highest and nub before any other checklist.
]]>
    


Author: fmendez@gbif.org
Comment: This can be easily fixed using "query elevation"; in some way this is related to another issue that requests that the Nub should be the default checklist.
Created: 2011-11-09 09:20:55.269
Updated: 2011-11-09 09:20:55.269


Author: mdoering@gbif.org
Created: 2011-11-09 18:33:19.343
Updated: 2011-11-09 18:33:19.343
        
From what I read this boosts only manually selected documents, not one field over another.
I would have thought we want to boost the document scoring, so a combination of these spring to my mind:

1) Boost Document Fields at index time - by calling field.setBoost() before adding a field to the document. We should boost scientificName over vernacularName over the higher ranks. See http://lucene.apache.org/java/3_4_0/scoring
For example sth like this:
scientificName=10
vernacularName=8
species=9
genus=8
subgenus=7
family=6
order=4
class=3
phylum=2
kingdom=1

2) lengthNorm - matches on a smaller field score higher than matches on a larger field
=> this should boost the full matches. "Aves" should score higher than "Arminia aves" for a search "aves"

3) a FunctionQuery to boost accepted over synonyms and higher ranks over lower ones? see http://wiki.apache.org/solr/FunctionQuery

4) we add some popularity boosting by keeping track of clicked species. See http://stackoverflow.com/questions/2944158/solr-lucene-user-click-based-ranking and http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29



    


Author: mdoering@gbif.org
Created: 2011-11-09 18:38:05.905
Updated: 2011-11-09 18:38:05.905
        
And boosting all nub documents would also be simple by doing document.setBoost() at index time for the nub records.
Using this, we could also set a slightly different boost for the different checklists, like Catalogue of Life, IUCN realists and IPNI being scored higher than others, but not as high as the nub? If we could maintain these boost settings in the Checklist model object itself that'll be awesome.
    


Author: fmendez@gbif.org
Created: 2011-11-10 09:40:58.366
Updated: 2011-11-10 09:40:58.366
        
1) The scoring numbers are very good idea, the only problem with is that we should change the way we are indexing documents, because by the moment am using a Data Import Handler and using that is a bit problematic applying index-time boosting.
2)leghtNorm: i tried to applied that in the past but it didn't affect much the results, i will explore more this option
3) We should explore more deeply.
4) That option requires the dismax query handler and am not sure but i think dismax doesn't support wildcards, there's another query handler edismax that supports wildcards, the pattern in both cases can be applied to a set of fields and there's no way to boost specific search patterns for example, boost "aves" over "aves*".

In any case all the ideas require trial and error, in this moment we can split the problems we have in 2: exact match scoring and general fields scoring; am working on the first issue and if you want you can start working with general fields scoring (in point 1): this can be added to Solr schema without changing the source code.
    


Author: mdoering@gbif.org
Comment: Keep this issue open until the staging portal really shows its fixed - i.e. we gotta first rebuild the index. We should also setup a selenium test to make sure it stays fixed!
Created: 2011-11-11 21:12:03.199
Updated: 2011-11-11 21:12:03.199


Author: mdoering@gbif.org
Created: 2011-11-16 00:01:56.238
Updated: 2011-11-16 00:01:56.238
        
Ive added a selenium test to make sure the nub Aves comes first - which it does now!
But the next records are a mix of all kind of usages within the Aves class.
This is ok, but I would expect the higher ranks to show before the lower ones.
Currently we get these initial results:

 Aves class
 Pleurothallis aves-seriales Luer & R.Escobar species synonym
 Curaeus forbesi (P. L. Sclater, 1886) species
 Pyrrhula murina Godman, 1866 species
 Diomedeidae Gray, 1840 family
 Anseranatidae Sclater, 1880 family
 Charadriidae family
 Phoebetria Reichenbach, 1853 genus
 Phoebetria immutabilis species
 Cladornithidae family

The higher the rank, the slightly higher the scoring should be.
Setting a boost to the doc based on the rank at index time comes to my mind as a solution - or to use a function query.

I have just tried to add a function query to the solr query manually based on the number of species of a usage.
This works pretty well and I think this is even more useful than sorting my rank only.
It will return the taxa with the most species and therefore also likelier more popular taxa first.
The addition I used was the log of num_species. Using num_species directly caused planate and animal to show up first even though there is no aves hit:  +_val_:"log(num_species)"

The example query I used:

http://jawa.gbif.org:8080/solr/select/?&fl=*,score&version=2.2&start=0&rows=10&indent=on&q=%28checklist_title%3A%22GBIF+Taxonomic+Backbone%22+AND+%28canonical_name%3AAves%5E100.0+OR+class%3AAves%29%29+_val_%3A%22log(num_species)%22

We could think about also making use of the num_occurrences. A combination of both would be useful, but its going to be difficult to find the right balance as the numbers are widely different and occurrences can be really high and distorting the results.
    


Author: mdoering@gbif.org
Created: 2011-11-16 00:26:26.448
Updated: 2011-11-16 00:26:26.448
        
A combination of both species and occurrences can be done using the scale function like this:

+_val_:"sum(scale(num_species,1,5),scale(num_occurrences,1,3))"
    


Author: mdoering@gbif.org
Comment: ended up using a sort by, but works fine now!
Created: 2011-11-17 15:50:32.467
Updated: 2011-11-17 15:50:32.467