Issue 13529

Hi, I found something a little strange debatabl...

13529
Reporter: feedback bot
Assignee: fmendez
Type: Bug
Summary: Hi,    I found something a little strange debatabl...
Resolution: Fixed
Status: Closed
Created: 2013-07-23 16:47:18.303
Updated: 2016-09-28 17:04:04.844
Resolved: 2013-07-29 13:57:43.294
        
        
Description: Hi,

I found something a little strange debatable while playing with the occurrences export.

As the export format seems to be a small superset of DwC-A, I'm actually adding "support" for it into https://github.com/BelgianBiodiversityPlatform/python-dwca-reader

The issue is:

Within the attached file (export from the portal), you can notice that in line 189433446, one field (one of the last) contains the Unicode new line character (http://www.charbase.com/0085-unicode-next-line-nel).

When using external tools to parse the occurrence file (Python standard library in my case) and setting them up for UTF-8 (since it's reported to be UTF-8 in meta.xml), line is automatically considered as finished when encountering this character.

I have mixed feeling about if this is a bug or not:

- on one side, as the metafile specify \n in linesTerminatedBy, it is not unreasonable to filter all other characters when splitting the file in lines.
- on the other, it is still strange to encounter this character in an UTF-8 file and simply ignore it. And I guess many consumer tools will have problem like this. In case of Python, I'll probably need to subclass the standard File class.

*Reporter*: Nicolas NoƩ
*E-mail*: [mailto:n.noe@biodiversity.be]]]>
    
Attachment gbif-results.zip


Author: fmendez@gbif.org
Created: 2013-07-29 13:57:43.324
Updated: 2013-07-29 13:57:43.324
        
Tabs and "line break" characters are removed from uninterpreted text fields

http://code.google.com/p/gbif-occurrencestore/source/detail?r=2085