autocomplete for names shows informal names as top hits
10918
Reporter: mdoering
Assignee: fmendez
Type: Improvement
Summary: autocomplete for names shows informal names as top hits
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-03-02 10:52:13.094
Updated: 2013-12-09 13:41:01.715
Resolved: 2012-07-09 14:42:04.062
Description: when entering only one letter "k into the search forms the autocomplete comes up with informal spec. names, see attached screenshot.
Would be good to avoid autocomplete on these informal names by filtering/sorting by name type]]>
Author: fmendez@gbif.org
Created: 2012-07-06 13:36:04.55
Updated: 2012-07-06 13:36:04.55
There are several implications for this improvement:
- The "suggest" service is by default implemented by using facets, Solr only supports 2 sorting options, by: count or index (lexicography order).
- Facets are a good option due that we are displaying the most relevant results in terms of how much data we have for the input paramater, if we change this to order the results by "name_type" the following scenario is possible: imagine we have the informal canonical_name "kokuria spec." repeated 10 times and its name_type = 3(i.e. facet.count =10), also we have "kokuria" 1 time and its name_type = 0. The question is: which one is more relevant, the name with more counts or the preferred name_type?
- One option is to avoid the use of facets, the problem is that there's no way of implementing a "SELECT DISTINCT" in Solr without using facets. Additionally, we are using the field canonical_name for the suggest service, such field is not stored which means we have to change the schema en re-create the index.
Author: mdoering@gbif.org
Created: 2012-07-06 13:57:51.844
Updated: 2012-07-06 13:57:51.844
I would think the sorting for autocomplete should be based on primarily these:
* monomials (1 word) > bi- > trinomials
* secondary sorting by name type: WELLFORMED > SCINAME > VIRUS > DOUBTFUL > HYBRID > CULTIVAR > INFORMAL >>> ignore BLACKLISTED
* number of species for monomials, maybe also number of occurrences
A combined value of all these would be great and not a strict sorting over the first, then name type etc.
Is tere a chance we add a field in solr for that? Hm, no chance for using facets I suppose. Needs more thinking
Author: mdoering@gbif.org
Created: 2012-07-06 14:27:30.576
Updated: 2012-07-06 14:27:30.576
As a start I suggest to simply exclude all blacklisted and informal names from the autocomplete, that'll at least remove the nasty names from the UI :)
Lets create another issue for the sort order though or rename this one and keep it open
Author: fmendez@gbif.org
Created: 2012-07-06 15:47:44.842
Updated: 2012-07-06 15:47:44.842
ok, we can easily add a filter to exclude the informal names. A more complicated escenario like the one described in the second comment can't be accomplished by facets, the real problem of avoiding facets for this is, again, the implementation of "SELECT DISTINCT" to get distinct values. 2 more options:
- Try with facet pivots, in this way we can get facets like name_type,canonical_name[field:name_type,value:0,count:22,pivot:{field:canonical_name,value:kikso spec,count: 89}....]. The problem here is that we need an lexicographic order for name_type(0,1,2,3...) and count based order for canonical_name
- Create a new "small" index containing only canonical_names, this index can live in the same solr instance; for this we should create 2 solr cores: 1 for the current index and another for the canonical_name's index.
Author: mdoering@gbif.org
Created: 2012-07-06 16:22:38.632
Updated: 2012-07-06 16:22:38.632
A separate index with only distinct canonical names, a multi value for datasetKey and another ordering weight which we can populate any way we like sounds ideal.
Id suggest this becomes a new issue for a later 0.3 release
Author: fmendez@gbif.org
Comment: ok, by the moment i'll exclude BLACKLISTED and INFORMAL name types
Created: 2012-07-06 17:06:40.865
Updated: 2012-07-06 17:06:40.865