Issue 14611
Bad character encodings
14611
Reporter: feedback bot
Assignee: jlegind
Type: Task
Summary: Bad character encodings
Status: Open
Created: 2014-01-10 16:33:36.667
Updated: 2014-01-13 11:49:00.877
Description: This is a follow-up to http://dev.gbif.org/issues/browse/DM-181: I realized these sorts of problems could be searched for more systematically. We have a local image of the old gbif, so I tried
select institution_code, collection_code, catalogue_number, locality from raw_occurrence_record where locality like _utf8 '%Ã%' collate utf8_bin and institution_code != 'nrm' limit 30;
If the A-tilde gets corrupted, try this:
select institution_code, collection_code, catalogue_number, locality from raw_occurrence_record where locality like _utf8 x'25C38325' collate utf8_bin and institution_code != 'nrm' limit 30;
(I wasn't able to find any of the bad records from NRM in the current GBIF, so that's why it's excluded.)
Some of these are Portuguese and are fine, but others are clearly problematic, for example,
| OHN | OHN | OHN 100457 | Ãstersjövägen 78. KvarstÃ¥ende frÃ¥n plantskola.
... and for this one, the record in the live GBIF has the bad encoding, too: http://www.gbif.org/occurrence/search?CATALOG_NUMBER=OHN+100457
Here are a couple more examples:
http://www.gbif.org/occurrence/search?CATALOG_NUMBER=NC083353
http://www.gbif.org/occurrence/2245560
Sorry if this is overkill.
*E-mail*: [mailto:cmccallum, at fas dot harvard dot edu]]]>
Author: kbraak@gbif.org
Created: 2014-01-13 11:48:55.416
Updated: 2014-01-13 11:48:55.416
Jan, can you please confirm the following publishers have encoding problems in their source data, and then help them resolve these problems?
1. See locality in http://www.gbif.org/occurrence/120137 (TAPIR dataset: http://www.gbif.org/dataset/41b59050-0a12-11dd-953d-b8a03c50a862)
2. See locality in http://www.gbif.org/occurrence/230999006 (BioCASE dataset: http://www.gbif.org/dataset/865df020-f762-11e1-a439-00145eb45e9a)
3/ See locality in http://www.gbif.org/occurrence/2245560 (TAPIR dataset: http://www.gbif.org/dataset/50b31640-0c6f-11dd-84d3-b8a03c50a862)