Issue 18438

Possibly malformed Darwin Core Archive on download

18438
Reporter: feedback bot
Type: Bug
Summary: Possibly malformed Darwin Core Archive on download 
Resolution: Fixed
Status: Closed
Created: 2016-04-27 15:28:45.012
Updated: 2016-04-29 08:57:34.022
Resolved: 2016-04-29 08:57:33.988
        
        
Description: Hi,

I recently generated the following download http://www.gbif.org/occurrence/download/0013982-160311141623029 and thing there's an issue with it:

The metafile refers to 235 fields in core file (index between 0 and 234). In occurrence.txt, and specifies that lines are terminated by \n, while fields are terminated by \t.

-I can find 234 instances of \t in the first header line. That means 235 fields, so that's consistent with the metafile.
- However, the following line only contains 233 instances of \t, so 234 fields.

I *think* the archive is therefore malformed, but can't be 100% sure because it's easy to get confused when playing with CSV dialects, file encodings, ... However, the problem seems to be visible when opening the file (after shortening it with the head unix command) with LibreOffice (data not aligned with headers in the last columns, for example "protocol"=2016-03-04T11:06Z), with python-dwca-reader and also when manually checking separators using unix commands such as described at: http://stackoverflow.com/questions/11035180/count-number-of-tab-characters-in-linux

Please tell me if I can help by providing more details!

Nicolas]]>
    


Author: mblissett
Created: 2016-04-27 15:55:13.634
Updated: 2016-04-27 15:55:13.634
        
[~cgendreau], I think there's a coordinateAccuracy field in the TSV header, but it isn't in the data.

(Also, full marks to the reporter for shell tools to verify a bug.)
    


Author: cgendreau
Created: 2016-04-27 16:02:53.01
Updated: 2016-04-27 16:02:53.01
        
Indeed coordinateAccuracy should not be there.
But, from the sample I used it doesn't introduced a column shift.
    


Author: mblissett
Created: 2016-04-27 16:08:47.431
Updated: 2016-04-27 16:08:47.431
        
In Nicolas' download, it's one of these headings that's extra and shouldn't be there:

coordinateAccuracy	elevation	elevationAccuracy	depth	depthAccuracy	distanceAboveSurface	distanceAboveSurfaceAccuracy

    


Author: cgendreau
Created: 2016-04-27 20:13:02.902
Updated: 2016-04-27 20:13:02.902
        
I removed coordinateAccuracy from the headers since it was not in the HDFS table anymore.
This is pushed with a new test to check consistency across different Terms sets (which includes headers vs HDFS "columns").