The encoding of characters with accents is not correct
15408
Reporter: atalavan
Assignee: jlegind
Type: Task
Summary: The encoding of characters with accents is not correct
Status: InProgress
Created: 2014-03-20 15:59:03.918
Updated: 2014-04-23 17:29:47.104
Description: 'Aldeadávila' and 'georreferenciación' are not correctly displayed.
*Reporter*: Alberto González
*E-mail*: [mailto:atalavan]]]>
Author: mdoering@gbif.org
Created: 2014-03-20 16:02:13.895
Updated: 2014-03-20 16:02:13.895
Indeed, but this is very, very likely the problem of the source as its working in many, many other datasets and its a typical configuration error
Author: atalavan
Comment: Is there something we can do about it? Is it IPT related? can we improve the documentation? Is it something that we can ask the Node managers to check before endorsing a dataset?
Created: 2014-03-20 16:06:20.573
Updated: 2014-03-20 16:06:20.573
Author: kbraak@gbif.org
Created: 2014-03-20 16:46:17.933
Updated: 2014-03-20 16:46:17.933
The IPT based publisher should verify the encoding of their source data, and make sure the source is configured correctly. See https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Source_Data
Indeed, the problem lies at the publisher's end.
Author: mdoering@gbif.org
Comment: Ive verified the archives text files and they are wrongly encoded utf8 files. See attached screenshot of a good utf8 file and then the archive in question.
Created: 2014-03-20 21:12:53.697
Updated: 2014-03-20 21:12:53.697
Author: atalavan
Created: 2014-03-21 09:29:11.072
Updated: 2014-03-21 09:29:11.072
I think my point remains. In the GBIF data portal some information is displayed incorrectly and we should be able to do something about it. Instead of closing this issue, this should be reassigned to some other jira issue where this problem can be addressed. We have many Work Programme and service related Jiras, so there must be something we can do.
Maybe Andrea could have a look at this kind of workflows?
Author: kbraak@gbif.org
Comment: Issue moved to Data Management, and assigned to [~jlegind@gbif.org] to follow up with the publisher directly.
Created: 2014-03-21 09:47:50.267
Updated: 2014-03-21 09:47:50.267
Author: jlegind@gbif.org
Comment: [~atalavan] The publisher has been contacted about the character encoding issue.
Created: 2014-03-27 10:45:24.12
Updated: 2014-03-27 10:45:24.12
Author: atalavan
Created: 2014-04-08 11:01:19.481
Updated: 2014-04-08 11:01:19.481
More about encoding:
https://twitter.com/BlasMBenito/status/452772867093303296
Twitter from one of the main trainers of ENM courses in Spain who collaborates frequently with the node.
Author: atalavan
Comment: The Node Manager of GBIF Portugal has recently highlighted issues with encoding recently, as it affects all the Nodes involved in the mentoring project between Portugal, France and Spain. It seems that the topic is raising interest these days.
Created: 2014-04-23 16:42:52.818
Updated: 2014-04-23 16:42:52.818
Author: ahahn@gbif.org
Created: 2014-04-23 17:04:23.403
Updated: 2014-04-23 17:15:24.253
Reading the mail you are referring to in your latest comment (Node Manager GBIF Portugal, 11.2.2014), the issue does not seem to be on the side of GBIF tools or index any longer, but mostly on the side of data input and data digestion. Some issues mentioned there are:
- downloads: when imported into Excel, special characters get malformed (client UI issue, also applies to some db clients - any client application will need to be "told" that the source data come in utf-8)
- data entry: editors with non-suited keyboards or not familiar with accented characters fall back on the "plain" version during data entry
- portal "fuzzy search": does not appear to be lenient enough looking for possible variants with/without accented characters (need to verify)
- indexing workflow (?): are we too generous letting non-utf-8 characters through? (need to verify)
The main point in the mail is that users and data editors need to be better educated in dealing with special characters in data digestion and data entry.
As such, it is not a data management issue at all, apart from the specific dataset at hand - [~atalavan], [~jlegind@gbif.org]: I'd suggest to open a new sub-story under the documentation WP epic (http://dev.gbif.org/issues/browse/GBIF-24), if required also a new issue each to check on a) data exports and on b) indexing for non-utf-8 characters, and to close the item here as soon as the encoding issue at source has been fixed.