Issue 12577

gbif-common-search: Strip HTML from fields when building lucene index

12577
Reporter: kbraak
Assignee: mdoering
Type: Improvement
Summary: gbif-common-search: Strip HTML from fields when building lucene index 
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-01-08 18:18:51.399
Updated: 2013-12-06 13:16:02.789
Resolved: 2013-03-05 12:08:06.673
        
Description: Take a look at the attached screenshot.

The 2nd search result has a description containing HTML: "



" (see http://api.gbif.org/dev/name_usage/7152139/descriptions ) This isn't a portal problem though. Such HTML should be stripped when building lucene indices. Ideally something like http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer should be added to gbif-common-search, and used by checklistbank-index-builder and registry-index-builder when building the checklistbank and registry indices respectively. ]]>

Attachment Screen Shot 2013-01-08 at 5.53.54 PM.png



Author: mdoering@gbif.org
Comment: As we dont have any common indexer classes this can currently only be applied to the individual indexers. Many fields also do not contain any html. I will apply the solr HTMLStripReader within the NameUsageDocConverter to the description field which contains most html and started this issue
Created: 2013-03-04 17:38:17.815
Updated: 2013-03-04 17:38:17.815


Author: mdoering@gbif.org
Created: 2013-03-04 17:57:37.15
Updated: 2013-03-04 17:57:37.15
        
since lucene 4 this newer class is recommended: http://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html

We can simply declare a solr field to use it and dont need to invade our code
    


Author: mdoering@gbif.org
Created: 2013-03-05 12:08:06.747
Updated: 2013-03-05 12:08:06.747
        
Only fixed for checklistbank!

https://code.google.com/p/gbif-ecat/source/detail?r=5380
https://code.google.com/p/gbif-ecat/source/detail?r=5381