Issue 10302

Vernacular name search fails for non ascii characters

Reporter: mdoering
Assignee: mdoering
Type: Bug
Summary: Vernacular name search fails for non ascii characters
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2011-11-14 11:45:16.378
Updated: 2013-08-29 14:45:01.897
Resolved: 2011-11-17 15:47:33.569
Description: Try a search with non ascii characters like "Közönséges jegenyefenyo which is a common name for Abies alba:

When doing an expanded search we should see the catalogue of life record, but we don't hit anything:özönséges+jegenyefenyo&initDefault=false

Sth with the char encoding goes probably wrong here!]]>

Created: 2011-11-15 11:51:32.339
Updated: 2011-11-15 11:51:32.339
Using the asciifolding filter in solr and the new index with real nub vernacular names didn't solve the issue.
Searching for "Spanische Tanne" now hits the nub usage for Abies pinsapo, but using name "španjolska jell" fails:španjolska+jela

Created: 2011-11-15 14:51:02.291
Updated: 2011-11-15 14:51:02.291
This seems to be a web portal issue now, not solr.
The direct solr search with umlauts is working fine:örnchen%5E10.0+OR+scientific_name_ft%3A*Eichhörnchen*%5E0.2+OR+vernacular_name_ft%3AEichhörnchen%29&version=2.2&start=0&rows=10&indent=on

Created: 2011-11-15 15:31:09.761
Updated: 2011-11-15 15:31:09.761
monitoring the http requests you can see the request does not specify any character encoding.
According to specs a servlet container has to default to latin1 in that case:

Strange enough the application runs fine under jetty, only tomcats breaks the encoding.
I will try to force tomcat to use a utf8 default - otherwise others have successfully used a servlet filter to enforce utf8:

Created: 2011-11-15 15:33:19.513
Updated: 2011-11-15 15:33:19.513
Even tomcat recommends to use a filter, so I will add one

Created: 2011-11-16 22:00:42.622
Updated: 2011-11-16 22:00:42.622
Setting the request char encoding does not help at all.
Its a servlet container issue, as jetty has no problems, but tomcat does.

It boils down to different URL encoding procedures around, the older using latin1, the newer using UTF8.
Modern browsers seem to encode using the webpage encoding, thus UTF8. But tomcat then decodes it using latin1 and produces garbage.
See tomcat faq:

The simplest solution applied now is to use POST instead of GET to avoid this.
The most correct solution is to configure tomcat to respect the request encoding:

"""Set the useBodyEncodingForURI attribute on the  element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters."""

Comment: Cannot use POST forms as it will break paging and does not allow to link queries
Created: 2011-11-17 14:34:10.22
Updated: 2011-11-17 14:34:10.22