Created: 2013-07-23 16:47:18.303
Description: Hi,

I found something a little strange debatable while playing with the occurrences export.

As the export format seems to be a small superset of DwC-A, I'm actually adding "support" for it into

The issue is:

Within the attached file (export from the portal), you can notice that in line 189433446, one field (one of the last) contains the Unicode new line character (

When using external tools to parse the occurrence file (Python standard library in my case) and setting them up for UTF-8 (since it's reported to be UTF-8 in meta.xml), line is automatically considered as finished when encountering this character.

I have mixed feeling about if this is a bug or not:

- on one side, as the metafile specify \n in linesTerminatedBy, it is not unreasonable to filter all other characters when splitting the file in lines.
- on the other, it is still strange to encounter this character in an UTF-8 file and simply ignore it. And I guess many consumer tools will have problem like this. In case of Python, I'll probably need to subclass the standard File class.

*Reporter*: Nicolas NoƩ
*E-mail*: []]]>

Tabs and "line break" characters are removed from uninterpreted text fields