Issue 18238
OccurrenceParser badly detects character encodings
18238
Reporter: mdoering
Type: Bug
Summary: OccurrenceParser badly detects character encodings
Description: OccurrenceParser tries to detect the character encoding from the gzip file directly - that is surely wrong as the method expects a text file, not sth compressed. It also prefers utf8 or latin1 if no exception is thrown ignoring any user or detected encoding which might lead to wrong characters
Priority: Major
Status: Open
Created: 2016-02-17 15:02:58.387
Updated: 2016-02-17 15:43:20.168
Resolved: 2016-02-17 15:02:58.377
Author: mblissett
Created: 2016-02-17 15:05:05.084
Updated: 2016-02-17 15:05:05.084
"charsets are a nightmare and users can't be trusted"
We could probably do that detection on all the DWCAs we have downloaded, and unless it would cause problems use the user's specified charset first.
It seems unlikely that ignorant users are using UTF-8.