Issue 10659
registry-metadata-service: Convert 3 letter language ISO code into 2 letter language ISO code during eml parsing
10659
Reporter: kbraak
Assignee: kbraak
Type: Bug
Summary: registry-metadata-service: Convert 3 letter language ISO code into 2 letter language ISO code during eml parsing
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-01-20 18:57:28.791
Updated: 2013-12-16 17:50:34.028
Resolved: 2012-01-27 11:25:06.16
Description: Dataset language and metadata language are both fields of type Locale (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Locale.html) . The language must be inputed as 2-letter ISO639-1 string. During parsing of metadata in registry-metadata-service, these fields should be set applying this conversion if needed.
Supplementary info:
CLB uses the IPT vocab:
http://rs.gbif.org/vocabulary/iso/639-1.xml
This has 2+3 letter codes plus english titles
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-checklistbank/src/main/java/org/gbif/checklistbank/service/impl/ThesaurusImpl.java
]]>
Attachment iso639.txt
Attachment iso-codes debian.zip
Author: mdoering@gbif.org
Comment: is Locale the right datatype if we only have and want the language?
Created: 2012-01-20 21:41:14.428
Updated: 2012-01-20 21:41:14.428
Author: mdoering@gbif.org
Comment: iso code translations into many languages based on debian pot files
Created: 2012-01-20 23:39:23.189
Updated: 2012-01-20 23:39:23.189
Author: kbraak@gbif.org
Comment: To quote the Locale documentation, the locale is just a mechanism for identifying objects, not a container for the objects themselves. The payout with using a Locale is that you can query it for information about itself. For language, you can use getDisplayLanguage to get the name of the language suitable for displaying to the user. This qualifies it as an appropriate datatype even just if we want language. We have to be careful, because a Locale object is just an identifier for a region and no validity check is performed when you construct a Locale. For this reason, we have to be careful to use a valid 2-letter iso language code on construction. For example, you must use "da" for Danish. If you use "dan", this isn't understood and getDisplayLanguage just returns back the string "dan" instead of Danish.
Created: 2012-01-23 11:09:19.613
Updated: 2012-01-23 11:09:19.613
Author: mdoering@gbif.org
Comment: What about maintaining a language (and country?) enum as part of the common api? At some point somewhere we need to maintain a list of accepted codes - why not doing this in java if it helps type safety?
Created: 2012-01-23 11:29:34.415
Updated: 2012-01-23 11:29:34.415
Author: mdoering@gbif.org
Comment: A list of all iso 639-1 language codes mapped to the English display name, the native name and a semicolon concatenated list of unique names in all iso languages generated using the java Locale class
Created: 2012-01-23 13:39:02.222
Updated: 2012-01-23 13:39:02.222
Author: mdoering@gbif.org
Comment: Updated iso639.txt list including the 3 letter code in the 2nd column
Created: 2012-01-23 13:44:14.422
Updated: 2012-01-23 13:44:14.422
Author: kbraak@gbif.org
Comment: The iso-369-1 file you have created includes ISO 639-2/T (terminology) - should it also include ISO 639-2/B (english derived, ie ger for Germany)? There are 21 languages that have alternative codes for bibliographic or terminology purposes - http://www.loc.gov/standards/iso639-2/php/code_list.php
Created: 2012-01-24 16:10:55.491
Updated: 2012-01-24 16:10:55.491
Author: mdoering@gbif.org
Created: 2012-01-25 12:25:02.666
Updated: 2012-01-25 12:25:02.666
Ive added those entries to the parser file so they should be understood now:
http://code.google.com/p/gbif-common-resources/source/diff?spec=svn523&r=523&format=side&path=/gbif-parsers/trunk/src/main/resources/dictionaries/parse/iso-639-1.txt&old_path=/gbif-parsers/trunk/src/main/resources/dictionaries/parse/iso-639-1.txt&old=520