Issue 18764

Cannot use downloaded files due to embedded nulls

18764
Reporter: feedback bot
Type: Feedback
Summary: Cannot use downloaded files due to embedded nulls
Resolution: Fixed
Status: Closed
Created: 2016-10-04 22:43:01.27
Updated: 2016-10-05 16:45:09.874
Resolved: 2016-10-05 16:45:09.776
        
        
Description: I wished to use the GBIF to conduct CRACLE analyses but the downloaded occurrence.txt, especially for genera level filtered  downloads, will not properly read into R due to embedded nulls. I have attempted to specify encoding etc. but nothing seems to fix the issue. I will attempt downloading genera by genera and building the tables.

Over all I have found the interface with the GBIF very user unfriendly. It is very hard to get the information I require out of the database, even though it seems to be one of its primary functions.

*Reporter*: Tamara
*E-mail*: [mailto:tamara.fletcher@umontana.edu]]]>
    


Author: cgendreau
Created: 2016-10-05 09:18:29.184
Updated: 2016-10-05 09:18:29.184
        
I have recreated the download:
http://api.gbif.org/v1/occurrence/download/request/0016319-160910150852091.zip

The occurrence.txt looks ok, there is some "wrong encoding" characters coming from the source (e.g. http://www.gbif.org/occurrence/295002/fragment) but nothing to prevent reading (e.g. NUL char) as far as I can tell. Maybe the number of separators that varies between lines (when the value of the last column is not available, the last tab is missing) but this should not be an issue if you read it with R for example.

Not sure what "embedded nulls" refers to.

We need to write the user to have more information and also point to rGbif package.
    


Author: mblissett
Created: 2016-10-05 09:20:25.255
Updated: 2016-10-05 09:20:25.255
        
I already wrote to the user with a workaround.  There were embedded \0 values in two records in her download, in the simple CSV download under locality. This should be corrected, but we will still have them in DWCA downloads for verbatim fields -- perhaps documenting how to remove them would be useful for users:

Linux/Mac:

    tr -d '\0' < 0015532-160910150852091.csv > no-nulls.csv

PowerShell:

   (Get-Content "C:\0015532-160910150852091.csv") -replace "`0", "" | Set-Content "C:\no-nulls.csv"


    


Author: mdoering@gbif.org
Comment: Why don't we simplify everyones life and remove all those bad chars from the verbatim view already? I can't see anyone in need of accessing those characters in the verbatim data
Created: 2016-10-05 09:57:27.366
Updated: 2016-10-05 09:57:27.366


Author: cgendreau
Created: 2016-10-05 10:00:57.757
Updated: 2016-10-05 10:00:57.757
        
Perfect thanks Matt.

Download 0015532-160910150852091 was created on 3rd October 2016 so indeed, it should be fixed now. But, verbatim view should not include those characters.
    


Author: cgendreau
Created: 2016-10-05 10:08:03.762
Updated: 2016-10-05 10:08:03.762
        
I've just tested an archive with the same record: http://api.gbif.org/v1/occurrence/download/request/0016335-160910150852091.zip
The verbatim.txt doesn't include the NUL char.