Issue 11221

Use comma as default delimiter in final download

11221
Reporter: omeyn
Assignee: omeyn
Type: Bug
Summary: Use comma as default delimiter in final download 
Description: Right now uses x01 which is no good.
Priority: Major
Resolution: Invalid
Status: Closed
Created: 2012-05-22 16:11:45.05
Updated: 2013-12-17 15:16:51.288
Resolved: 2012-05-24 14:21:45.9


Author: trobertson@gbif.org
Created: 2012-05-22 16:26:12.262
Updated: 2012-05-22 16:26:12.262
        
How will you escape things with commas?  E.g. what will a record with the following field look like?

dwc:locality
This is the area known as "Lar' hood", which is often surrounded by angry mobs.




    


Author: omeyn@gbif.org
Created: 2012-05-22 16:53:31.72
Updated: 2012-05-22 16:53:31.72
        
Further discussion suggests we absolutely need to escape fields containing the delimiter using ".  Also escape any fields that have internal \n or \cr, and double any quoted quotes.  Propose an end result that looks like this:

1,"I have a, comma","I have two ""internal quotes"" because I'm special",nothing to see here

In other words, follow RFC 1480 (http://www.apps.ietf.org/rfc/rfc4180.html) and thereby get close to Excel's expected behaviour.
    


Author: mdoering@gbif.org
Created: 2012-05-22 21:36:19.07
Updated: 2012-05-22 21:36:19.07
        
I never had a simple experience with CSV files - nearly all dwca CSVs Ive seen had some problems.
Is there a strong reason not to use much simpler tab files? you don't need to quote data, but only replace \t and \n with a space or maybe two. 
    


Author: lfrancke@gbif.org
Comment: Why are tab files simpler? Isn't it just commas replaced with tabs? You still have to deal with empty fields, quoting data etc.
Created: 2012-05-23 10:15:48.265
Updated: 2012-05-23 10:15:48.265


Author: trobertson@gbif.org
Comment: I have the same experience as Markus.  Anyone thinking CSV is a standard would be surprised when working with a community of people using various tools to actually produce them.  "TAB File Tim" says +1 to tab as the delimiter, and replace tabs and new line chars with a single space.  
Created: 2012-05-23 10:16:14.153
Updated: 2012-05-23 10:16:14.153


Author: trobertson@gbif.org
Comment: The thing with commas is we cannot live with stripping them, but tabs we can normally just replace.  
Created: 2012-05-23 10:19:15.372
Updated: 2012-05-23 10:19:15.372


Author: mdoering@gbif.org
Created: 2012-05-23 10:24:53.875
Updated: 2012-05-23 10:24:53.875
        
and because commas cannot be replaced quoting becomes necessary and that introduces the main problems. Various tools namely most microsoft tools use "optional" quotes, i.e. they only quote when there is a space or comma (don't ask me why spaces matter). Other tools always quote. Then you now also have to escape the quotes and there are various options in doing so again (backspace, double or even triple quotes).

You can look at the CSVReader in the dwca reader and all the tests that document the hell we went through over the years.
http://code.google.com/p/darwincore/source/browse/trunk/dwca-reader/src/test/java/org/gbif/file/CSVReaderTest.java

    


Author: omeyn@gbif.org
Comment: Ok, marking this as invalid in favour of OCC-45
Created: 2012-05-24 14:21:34.012
Updated: 2012-05-24 14:21:34.012