Issue 15550

Registry WS not returning UTF-8

15550
Reporter: trobertson
Assignee: kbraak
Type: Bug
Summary: Registry WS not returning UTF-8
Priority: Blocker
Resolution: WontFix
Status: Resolved
Created: 2014-04-24 09:40:52.525
Updated: 2016-04-04 11:40:17.045
Resolved: 2014-04-24 14:08:08.154
        
Description: This might need investigating from IPT right through but from Dave Martin:
" http://api.gbif.org/v0.9/dataset?type=OCCURRENCE&country=SPAIN
isnt returning UTF-8"

Assigning to [~kbraak] since it probably needs traced to the IPT itself to find the cause.  When we know it, we should consider sharing the information about how this happened, so we can watch for it again.]]>
    


Author: mdoering@gbif.org
Created: 2014-04-24 10:39:07.459
Updated: 2014-04-24 10:39:28.636
        
We do not return the encoding in a header, just the json content type: Content-Type:application/json
JSON defaults to UTF8, but allows for other unicode charsets too, so maybe should add the encoding to the content type header:
{{Content-type: application/json; charset=utf-8}}

From http://www.ietf.org/rfc/rfc4627.txt:

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

    


Author: mdoering@gbif.org
Comment: In the api call above I cannot find any bad character, can someone point out the exact problem?
Created: 2014-04-24 10:43:44.183
Updated: 2014-04-24 10:43:44.183


Author: kbraak@gbif.org
Created: 2014-04-24 11:14:17.671
Updated: 2014-04-24 11:14:17.671
        
The problem could be restricted to the display in browsers.

On first load in my browsers, the api call above displays with the wrong encoding. Selecting the right unicode encoding (from the browser), naturally fixes the bad characters.

Looking at one of those datasets (e.g. http://www.gbif.es:8080/ipt/resource.do?r=leb-lichen), it displays fine in the portal:

http://www.gbif.org/dataset/2a89fac8-e079-419e-8883-70a4cd9c25e1

Markus' suggestion to add the content-type header is needed to fix it. That being said, we honor the default JSON encoding, so it will be like stating the obvious.

 
    


Author: kbraak@gbif.org
Comment: Closing as Won't Fix.
Created: 2014-04-24 14:08:08.174
Updated: 2014-04-24 14:08:08.174