Issue 14611

Bad character encodings

14611
Reporter: feedback bot
Assignee: jlegind
Type: Task
Summary: Bad character encodings
Status: Open
Created: 2014-01-10 16:33:36.667
Updated: 2014-01-13 11:49:00.877
        
        
Description: This is a follow-up to http://dev.gbif.org/issues/browse/DM-181: I realized these sorts of problems could be searched for more systematically. We have a local image of the old gbif, so I tried

select institution_code, collection_code, catalogue_number, locality from raw_occurrence_record where locality like _utf8 '%Ã%'  collate utf8_bin and institution_code != 'nrm' limit 30;

If the A-tilde gets corrupted, try this:

select institution_code, collection_code, catalogue_number, locality from raw_occurrence_record where locality like _utf8 x'25C38325'  collate utf8_bin and institution_code != 'nrm' limit 30;

(I wasn't able to find any of the bad records from NRM in the current GBIF, so that's why it's excluded.)

Some of these are Portuguese and are fine, but others are clearly problematic, for example,

| OHN              | OHN             | OHN 100457       | Östersjövägen 78. Kvarstående från plantskola.

... and for this one, the record in the live GBIF has the bad encoding, too: http://www.gbif.org/occurrence/search?CATALOG_NUMBER=OHN+100457

Here are a couple more examples:

http://www.gbif.org/occurrence/search?CATALOG_NUMBER=NC083353
http://www.gbif.org/occurrence/2245560

Sorry if this is overkill.
*E-mail*: [mailto:cmccallum, at fas dot harvard dot edu]]]>
    


Author: kbraak@gbif.org
Created: 2014-01-13 11:48:55.416
Updated: 2014-01-13 11:48:55.416
        
Jan, can you please confirm the following publishers have encoding problems in their source data, and then help them resolve these problems?

1. See locality in http://www.gbif.org/occurrence/120137 (TAPIR dataset: http://www.gbif.org/dataset/41b59050-0a12-11dd-953d-b8a03c50a862)

2. See locality in http://www.gbif.org/occurrence/230999006 (BioCASE dataset: http://www.gbif.org/dataset/865df020-f762-11e1-a439-00145eb45e9a)

3/ See locality in http://www.gbif.org/occurrence/2245560 (TAPIR dataset: http://www.gbif.org/dataset/50b31640-0c6f-11dd-84d3-b8a03c50a862)