Issue 10302

Vernacular name search fails for non ascii characters

10302
Reporter: mdoering
Assignee: mdoering
Type: Bug
Summary: Vernacular name search fails for non ascii characters
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2011-11-14 11:45:16.378
Updated: 2013-08-29 14:45:01.897
Resolved: 2011-11-17 15:47:33.569
        
Description: Try a search with non ascii characters like "Közönséges jegenyefenyo which is a common name for Abies alba:
http://staging.gbif.org:8080/portal-web-dynamic/species/105340718/vernaculars

When doing an expanded search we should see the catalogue of life record, but we don't hit anything:
http://staging.gbif.org:8080/portal-web-dynamic/species/search?q=Közönséges+jegenyefenyo&initDefault=false

Sth with the char encoding goes probably wrong here!]]>
    


Author: mdoering@gbif.org
Created: 2011-11-15 11:51:32.339
Updated: 2011-11-15 11:51:32.339
        
Using the asciifolding filter in solr and the new index with real nub vernacular names didn't solve the issue.
Searching for "Spanische Tanne" now hits the nub usage for Abies pinsapo, but using name "španjolska jell" fails:

http://staging.gbif.org:8080/portal-web-dynamic/species/search?q=Abies+pinsapo
http://staging.gbif.org:8080/portal-web-dynamic/species/search?q=Spanische+Tanne
http://staging.gbif.org:8080/portal-web-dynamic/species/search?q=španjolska+jela
    


Author: mdoering@gbif.org
Created: 2011-11-15 14:51:02.291
Updated: 2011-11-15 14:51:02.291
        
This seems to be a web portal issue now, not solr.
The direct solr search with umlauts is working fine:

http://jawa.gbif.org:8080/solr/select/?q=%28scientific_name_ft%3AEichhörnchen%5E10.0+OR+scientific_name_ft%3A*Eichhörnchen*%5E0.2+OR+vernacular_name_ft%3AEichhörnchen%29&version=2.2&start=0&rows=10&indent=on
    


Author: mdoering@gbif.org
Created: 2011-11-15 15:31:09.761
Updated: 2011-11-15 15:31:09.761
        
monitoring the http requests you can see the request does not specify any character encoding.
According to specs a servlet container has to default to latin1 in that case:
http://tomcat.apache.org/tomcat-7.0-doc/config/filter.html#Add_Default_Character_Set_Filter

Strange enough the application runs fine under jetty, only tomcats breaks the encoding.
I will try to force tomcat to use a utf8 default - otherwise others have successfully used a servlet filter to enforce utf8:
http://stackoverflow.com/questions/1958797/struts2-request-character-encoding
http://stackoverflow.com/questions/2381891/parameters-charset-conversion-in-struts2
    


Author: mdoering@gbif.org
Created: 2011-11-15 15:33:19.513
Updated: 2011-11-15 15:33:19.513
        
Even tomcat recommends to use a filter, so I will add one
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8
    


Author: mdoering@gbif.org
Created: 2011-11-16 22:00:42.622
Updated: 2011-11-16 22:00:42.622
        
Setting the request char encoding does not help at all.
Its a servlet container issue, as jetty has no problems, but tomcat does.

It boils down to different URL encoding procedures around, the older using latin1, the newer using UTF8.
Modern browsers seem to encode using the webpage encoding, thus UTF8. But tomcat then decodes it using latin1 and produces garbage.
See tomcat faq: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q2

The simplest solution applied now is to use POST instead of GET to avoid this.
The most correct solution is to configure tomcat to respect the request encoding:

"""Set the useBodyEncodingForURI attribute on the  element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters."""
    


Author: mdoering@gbif.org
Comment: Cannot use POST forms as it will break paging and does not allow to link queries
Created: 2011-11-17 14:34:10.22
Updated: 2011-11-17 14:34:10.22