Issue 18238

OccurrenceParser badly detects character encodings

18238
Reporter: mdoering
Type: Bug
Summary: OccurrenceParser badly detects character encodings
Description: OccurrenceParser tries to detect the character encoding from the gzip file directly - that is surely wrong as the method expects a text file, not sth compressed. It also prefers utf8 or latin1 if no exception is thrown ignoring any user or detected encoding which might lead to wrong characters
Priority: Major
Status: Open
Created: 2016-02-17 15:02:58.387
Updated: 2016-02-17 15:43:20.168
Resolved: 2016-02-17 15:02:58.377


Author: mblissett
Created: 2016-02-17 15:05:05.084
Updated: 2016-02-17 15:05:05.084
        
"charsets are a nightmare and users can't be trusted"

We could probably do that detection on all the DWCAs we have downloaded, and unless it would cause problems use the user's specified charset first.

It seems unlikely that ignorant users are using UTF-8.
    


Author: mdoering@gbif.org
Created: 2016-02-17 15:43:20.168
Updated: 2016-02-17 15:43:20.168
        
this is for xml only, so biocase, taĆ¼ir & digir.
dwca-io does its own detection and there are lots of tests