15550
Reporter: trobertson
Assignee: kbraak
Type: Bug
Summary: Registry WS not returning UTF-8
Priority: Blocker
Resolution: WontFix
Status: Resolved
Created: 2014-04-24 09:40:52.525
Updated: 2016-04-04 11:40:17.045
Resolved: 2014-04-24 14:08:08.154
Description: This might need investigating from IPT right through but from Dave Martin:
" http://api.gbif.org/v0.9/dataset?type=OCCURRENCE&country=SPAIN
isnt returning UTF-8"
Assigning to [~kbraak] since it probably needs traced to the IPT itself to find the cause. When we know it, we should consider sharing the information about how this happened, so we can watch for it again.]]>
Author: mdoering@gbif.org
Created: 2014-04-24 10:39:07.459
Updated: 2014-04-24 10:39:28.636
We do not return the encoding in a header, just the json content type: Content-Type:application/json
JSON defaults to UTF8, but allows for other unicode charsets too, so maybe should add the encoding to the content type header:
{{Content-type: application/json; charset=utf-8}}
From http://www.ietf.org/rfc/rfc4627.txt:
3. Encoding
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
Author: mdoering@gbif.org
Comment: In the api call above I cannot find any bad character, can someone point out the exact problem?
Created: 2014-04-24 10:43:44.183
Updated: 2014-04-24 10:43:44.183
Author: kbraak@gbif.org
Created: 2014-04-24 11:14:17.671
Updated: 2014-04-24 11:14:17.671
The problem could be restricted to the display in browsers.
On first load in my browsers, the api call above displays with the wrong encoding. Selecting the right unicode encoding (from the browser), naturally fixes the bad characters.
Looking at one of those datasets (e.g. http://www.gbif.es:8080/ipt/resource.do?r=leb-lichen), it displays fine in the portal:
http://www.gbif.org/dataset/2a89fac8-e079-419e-8883-70a4cd9c25e1
Markus' suggestion to add the content-type header is needed to fix it. That being said, we honor the default JSON encoding, so it will be like stating the obvious.