Tab in data passed through to DWCA download, conflicts with delimiter
18752
Reporter: nickyn
Type: Feedback
Summary: Tab in data passed through to DWCA download, conflicts with delimiter
Description: This record (865658032 / http://data.rbge.org.uk/herb/E00389856) has a tab at the start of the county field. When the record is included in a download this tab character is passed through into the data files that compose the DWCA, causing the fields in the occurrence.txt file to mis-align.
Resolution: Fixed
Status: Closed
Created: 2016-09-28 16:46:33.874
Updated: 2016-10-04 17:17:33.2
Resolved: 2016-10-04 17:17:33.182
Author: cgendreau
Created: 2016-09-29 14:13:41.535
Updated: 2016-09-29 14:13:41.535
Hi Nicky,
I created a download[1] with only this records and I was not able to reproduce the issue.
There is less separator in the record(214) than in the header(217) which could also be an issue but all the fields are aligned.
Please let me know if I missed something.
[1] http://www.gbif.org/occurrence/download/0014154-160910150852091
Author: mblissett
Created: 2016-09-29 15:32:14.134
Updated: 2016-09-29 15:32:14.134
I ran this (in Zsh) on all Nicky's recent downloads, to find the issues:
{code}
for i in *.zip; do echo $i; unzip -p $i occurrence.txt | tr -d -c $'\t\n' | awk '{ print length }' | grep -v 234 | sort | uniq -c; echo; done
for i in *.zip; do echo $i; unzip -p $i $i:r.csv | tr -d -c $'\t\n' | awk '{ print length }' | grep -v 43 | sort | uniq -c; echo; done
{code}
This detects problems with these two: 0012956-160910150852091.zip 0010455-160910150852091.zip
* embedded tabs (some lines with >234 tabs)
* embedded newlines (many more lines with <234 tabs) i.e. PF-2625
Although it doesn't pick up Nicky's example (thanks for the undeniable proof), where the line with an embedded tab has the final field missing.
Author: cgendreau
Created: 2016-09-30 09:48:03.138
Updated: 2016-09-30 09:48:03.138
String are not sanitized properly in "big download" (around 200 000 records). It explains the different result for the same record in 2 different archives (they are not assembled the same way).
It is now fixed:
https://github.com/gbif/occurrence/commit/06cb96aa5f27051c16a5a989efc6b66e46493bc2
Thanks for reporting the issue.
Author: nickyn
Comment: GBIF occurrence ID 295234012 includes a tab in a field included in the simple CSV download format (locality)
Created: 2016-10-04 13:53:23.741
Updated: 2016-10-04 13:53:23.741
Author: cgendreau
Comment: The fix is now deployed and I have tested occurrenceId 865658032 in a "big download". The field(s) are properly sanitized.
Created: 2016-10-04 16:56:22.578
Updated: 2016-10-04 16:56:22.578