Issue 12601

dataset search doesnt find all datasets which are suggested

Reporter: mdoering
Assignee: fmendez
Type: Bug
Summary: dataset search doesnt find all datasets which are suggested
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-01-16 18:27:48.76
Updated: 2013-08-29 14:44:49.29
Resolved: 2013-03-06 15:12:28.292
Description: The dataset suggests more datasets than the search actually finds.
Suggest hits 3 more pontaurus datasets that are not part of the seach result:

Compare with attached screen

Attachment Screen Shot 2013-01-16 at 18.26.31.png

Attachment Screen Shot 2013-02-04 at 5.48.30 PM.png

Created: 2013-01-28 16:37:48.904
Updated: 2013-01-28 16:37:48.904
Autosuggest only searches the dataset title with type="text_auto_ngram" that uses solr.EdgeNGramFilterFactory, useful in matching prefix substrings in the index during query time. That's why returns all datasets with "pon" in the title.

Dataset search, on the other hand, does not use the same EdgeNGramFilterFactory. Actually, it searches on a dataset title that is configured as type="text" which uses solr.WordDelimiterFilterFactory. This causes words to be split into subwords on case transitions, meaning that PonTaurus gets broken down and indexed into "pon taurus".

This explains why matches the dataset with title PonTaurus, but not the other 3 datasetswith titles "Pontaurus", since their titles aren't split into subwords. For this reason, only an exact search will return all datasets with title "pontaurus",

It is therefore proposed that dataset search use the same (solr) field for dataset title used by autosuggest, that uses EdgeNGramFilterFactory.

I am testing this locally now.

Created: 2013-01-28 21:43:58.868
Updated: 2013-01-28 21:43:58.868
Good find Kyle. Using ngrams in the regular search would make sure we also find those datasets there. I am just not too excited about the ranking of the autosuggest. If you type _cape_ you get:

Southern Cape herbarium
RSPB - Capercaillie national surveys (SCARABBS)
Zoobenthos in surface sediments off Cape Lucul

I would much more like to see the full matches coming first. To do this we also need to match against the title as it is I think and _additionally_ use the ngrams. Or turn off ngrams in both, restricting search to full word matches? Did we ever try lucene4 fuzzy search which is much faster here?


Created: 2013-01-29 16:26:59.882
Updated: 2013-01-29 16:26:59.882
Agreed, full matches should come first, followed by partial matches.

Let me clarify, for phrase queries autosuggest uses solr.EdgeNGramFilterFactory and for single term queries it uses solr.NGramFilterFactory. Both these filters are suited for those use cases (phrases vs single term).

Therefore, in order to 1) keep full matches first and 2) bring back partial matches (similar to what appears in auto suggest), [] and I decided to try adding 2 more full text search fields for dataset_title that use solr.EdgeNGramFilterFactory and solr.NGramFilterFactory:


  fulltextFields = {
    @FullTextSearchField(field = "dataset_title", highlightField = "dataset_title", exactMatchScore = 20.0d,
      partialMatchScore = 1.0d),
    @FullTextSearchField(field = "dataset_title_ngram", highlightField = "dataset_title", exactMatchScore = 12.0d,
      partialMatchScore = 1.0d),
    @FullTextSearchField(field = "dataset_title_nedge", highlightField = "dataset_title", exactMatchScore = 10.0d,
      partialMatchScore = 1.0d),

To explain,

-field = "dataset_title" has type="text", meaning that the whole title will be tokenized. A rank of 20.0d means exact matches on any whole word from the title will have a higher score than partial matches.
-field = "dataset_title_ngram" has type="text_auto_ngram" meaning solr.NGramFilterFactory is used for characters 1 - 25 max.
-field = "dataset_title_nedge" has type="text_auto_edge" meaning solr.NGramFilterFactory is used for characters 1 - 25 max.

Running a test locally, the results were promising.

A search for "pon" came back with 39 results, with a similar list to the one from AutoSuggest:

PonTaurus collection
Pond Laboratory Meteorological Station
Pond Conservation - National Pond Monitoring Network collated pond survey data for Great Britain 1972 to 2007
Pond Area Estimates: Nine Study Regions in Alaska for 3 time periods (1950s, 1978-1982, 1999-2001) using remotely sensed images

Before, a search for "pon" only came back with 3 results:

I compared a search for "cape" also, and with the new configuration there were 2146 results, with the top results being:

Cape Floristic Region Environmental Data
Cape mountain zebra population in the Mountain Zebra National Park (1937-1995).
Habitat suitability indices for cape mountain zebra in three plant communities inthe mountain zebra national park

Before a search for "cape" came back with 107 results:

One thing I'm not certain about, is why this change produced so many more matches on External datasets. Otherwise like I said, it seems promising. [] what do you think?


Created: 2013-02-04 18:27:24.609
Updated: 2013-02-04 18:27:24.609
After committing the latest change ( ) to the SolrAnnottatedDataset there is a much closer correspondance between the auto suggest and the full text search (see screenshot).

The highlighting is not working perfectly yet though. I would expect all instances of "pon" in the title to be highlighted, would you []? I just tried changing the schema.xml, setting stored="true" for field name="dataset_title_nedge"and field name="dataset_title_ngram" but this didn't work.

Created: 2013-02-05 18:26:33.389
Updated: 2013-02-05 18:26:33.389
The problem was, that if a partial match occurred, the dataset title set didn't include the highlighted text. This change tries to correct that problem:

The registry solr index now needs to be rebuilt again.

Comment: The search results are accurate enough by the moment, the auto-suggest and full text search result couldn't be always close since they use different scoring mechanisms: auto suggests uses the dataset title only, while full text uses several fields that can affect the general score for a result
Created: 2013-03-06 15:12:28.326
Updated: 2013-03-06 15:12:28.326