Issue 10659

registry-metadata-service: Convert 3 letter language ISO code into 2 letter language ISO code during eml parsing

10659
Reporter: kbraak
Assignee: kbraak
Type: Bug
Summary: registry-metadata-service: Convert 3 letter language ISO code into 2 letter language ISO code during eml parsing
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-01-20 18:57:28.791
Updated: 2013-12-16 17:50:34.028
Resolved: 2012-01-27 11:25:06.16
        
Description: Dataset language and metadata language are both fields of type Locale (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Locale.html) . The language must be inputed as 2-letter ISO639-1 string. During parsing of metadata in registry-metadata-service, these fields should be set applying this conversion if needed.

Supplementary info:

CLB uses the IPT vocab:
http://rs.gbif.org/vocabulary/iso/639-1.xml
This has 2+3 letter codes plus english titles
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-checklistbank/src/main/java/org/gbif/checklistbank/service/impl/ThesaurusImpl.java

]]>
    
Attachment iso639.txt
Attachment iso-codes debian.zip


Author: mdoering@gbif.org
Comment: is Locale the right datatype if we only have and want the language?
Created: 2012-01-20 21:41:14.428
Updated: 2012-01-20 21:41:14.428


Author: mdoering@gbif.org
Comment: iso code translations into many languages based on debian pot files
Created: 2012-01-20 23:39:23.189
Updated: 2012-01-20 23:39:23.189


Author: kbraak@gbif.org
Comment: To quote the Locale documentation, the locale is just a mechanism for identifying objects, not a container for the objects themselves. The payout with using a Locale is that you can query it for information about itself. For language, you can use getDisplayLanguage to get the name of the language suitable for displaying to the user. This qualifies it as an appropriate datatype even just if we want language. We have to be careful, because a Locale object is just an identifier for a region and no validity check is performed when you construct a Locale. For this reason, we have to be careful to use a valid 2-letter iso language code on construction. For example, you must use "da" for Danish. If you use "dan", this isn't understood and getDisplayLanguage just returns back the string "dan" instead of Danish.
Created: 2012-01-23 11:09:19.613
Updated: 2012-01-23 11:09:19.613


Author: mdoering@gbif.org
Comment: What about maintaining a language (and country?) enum as part of the common api? At some point somewhere we need to maintain a list of accepted codes - why not doing this in java if it helps type safety?
Created: 2012-01-23 11:29:34.415
Updated: 2012-01-23 11:29:34.415


Author: mdoering@gbif.org
Comment: A list of all iso 639-1 language codes mapped to the English display name, the native name and a semicolon concatenated list of unique names in all iso languages generated using the java Locale class
Created: 2012-01-23 13:39:02.222
Updated: 2012-01-23 13:39:02.222


Author: mdoering@gbif.org
Comment: Updated iso639.txt list including the 3 letter code in the 2nd column
Created: 2012-01-23 13:44:14.422
Updated: 2012-01-23 13:44:14.422


Author: kbraak@gbif.org
Comment: The iso-369-1 file you have created includes ISO 639-2/T (terminology) - should it also include ISO 639-2/B (english derived, ie ger for Germany)? There are 21 languages that have alternative codes for bibliographic or terminology purposes - http://www.loc.gov/standards/iso639-2/php/code_list.php
Created: 2012-01-24 16:10:55.491
Updated: 2012-01-24 16:10:55.491


Author: mdoering@gbif.org
Created: 2012-01-25 12:25:02.666
Updated: 2012-01-25 12:25:02.666
        
Ive added those entries to the parser file so they should be understood now:
http://code.google.com/p/gbif-common-resources/source/diff?spec=svn523&r=523&format=side&path=/gbif-parsers/trunk/src/main/resources/dictionaries/parse/iso-639-1.txt&old_path=/gbif-parsers/trunk/src/main/resources/dictionaries/parse/iso-639-1.txt&old=520
    


Author: kbraak@gbif.org
Comment: On Dataset, InterpretedLanguage<String, Language> is used for fields language and metadataLanguage. I need to do the same for Organizations and Node now, but I will open up a separate issue for this work and mark this as resolved. 
Created: 2012-01-27 11:25:06.193
Updated: 2012-01-27 11:25:06.193