Issue 18752

Tab in data passed through to DWCA download, conflicts with delimiter

18752
Reporter: nickyn
Type: Feedback
Summary: Tab in data passed through to DWCA download, conflicts with delimiter
Description: This record (865658032 / http://data.rbge.org.uk/herb/E00389856) has a tab at the start of the county field. When the record is included in a download this tab character is passed through into the data files that compose the DWCA, causing the fields in the occurrence.txt file to mis-align.
Resolution: Fixed
Status: Closed
Created: 2016-09-28 16:46:33.874
Updated: 2016-10-04 17:17:33.2
Resolved: 2016-10-04 17:17:33.182
        
    


Author: cgendreau
Created: 2016-09-29 14:13:41.535
Updated: 2016-09-29 14:13:41.535
        
Hi Nicky,

I created a download[1] with only this records and I was not able to reproduce the issue.
There is less separator in the record(214) than in the header(217) which could also be an issue but all the fields are aligned.

Please let me know if I missed something.

[1] http://www.gbif.org/occurrence/download/0014154-160910150852091
    


Author: nickyn
Created: 2016-09-29 15:02:28.365
Updated: 2016-09-29 15:02:28.365
        
The record is included in download ID 0010543-160910150852091 - in the occurrence.txt file within this dwca it is on line: 1943818
I’ve pasted below the headers from occurrence.txt split into numbered lines and this occurrence data line. The tab is in the column county (column 123) and is mis-aligning the subsequent fields (an example is the field publishingCountry in column 208 which on this data line instead contains the datasetKey UUID).

$head -n1 occurrence.txt | tr '\t' '\n' | cat -n
     1  gbifID
     2  abstract
     3  accessRights
     4  accrualMethod
     5  accrualPeriodicity
     6  accrualPolicy
     7  alternative
     8  audience
     9  available
    10  bibliographicCitation
    11  conformsTo
    12  contributor
    13  coverage
    14  created
    15  creator
    16  date
    17  dateAccepted
    18  dateCopyrighted
    19  dateSubmitted
    20  description
    21  educationLevel
    22  extent
    23  format
    24  hasFormat
    25  hasPart
    26  hasVersion
    27  identifier
    28  instructionalMethod
    29  isFormatOf
    30  isPartOf
    31  isReferencedBy
    32  isReplacedBy
    33  isRequiredBy
    34  isVersionOf
    35  issued
    36  language
    37  license
    38  mediator
    39  medium
    40  modified
    41  provenance
    42  publisher
    43  references
    44  relation
    45  replaces
    46  requires
    47  rights
    48  rightsHolder
    49  source
    50  spatial
    51  subject
    52  tableOfContents
    53  temporal
    54  title
    55  type
    56  valid
    57  institutionID
    58  collectionID
    59  datasetID
    60  institutionCode
    61  collectionCode
    62  datasetName
    63  ownerInstitutionCode
    64  basisOfRecord
    65  informationWithheld
    66  dataGeneralizations
    67  dynamicProperties
    68  occurrenceID
    69  catalogNumber
    70  recordNumber
    71  recordedBy
    72  individualCount
    73  organismQuantity
    74  organismQuantityType
    75  sex
    76  lifeStage
    77  reproductiveCondition
    78  behavior
    79  establishmentMeans
    80  occurrenceStatus
    81  preparations
    82  disposition
    83  associatedReferences
    84  associatedSequences
    85  associatedTaxa
    86  otherCatalogNumbers
    87  occurrenceRemarks
    88  organismID
    89  organismName
    90  organismScope
    91  associatedOccurrences
    92  associatedOrganisms
    93  previousIdentifications
    94  organismRemarks
    95  materialSampleID
    96  eventID
    97  parentEventID
    98  fieldNumber
    99  eventDate
   100  eventTime
  101  startDayOfYear
   102  endDayOfYear
   103  year
   104  month
   105  day
   106  verbatimEventDate
   107  habitat
   108  samplingProtocol
   109  samplingEffort
   110  sampleSizeValue
   111  sampleSizeUnit
   112  fieldNotes
   113  eventRemarks
   114  locationID
   115  higherGeographyID
   116  higherGeography
   117  continent
   118  waterBody
   119  islandGroup
   120  island
   121  countryCode
   122  stateProvince
   123  county
   124  municipality
   125  locality
   126  verbatimLocality
   127  verbatimElevation
   128  verbatimDepth
   129  minimumDistanceAboveSurfaceInMeters
   130  maximumDistanceAboveSurfaceInMeters
   131  locationAccordingTo
   132  locationRemarks
   133  decimalLatitude
   134  decimalLongitude
   135  coordinateUncertaintyInMeters
   136  coordinatePrecision
   137  pointRadiusSpatialFit
   138  verbatimCoordinateSystem
   139  verbatimSRS
   140  footprintWKT
   141  footprintSRS
   142  footprintSpatialFit
   143  georeferencedBy
   144  georeferencedDate
   145  georeferenceProtocol
   146  georeferenceSources
   147  georeferenceVerificationStatus
   148  georeferenceRemarks
   149  geologicalContextID
   150  earliestEonOrLowestEonothem
   151  latestEonOrHighestEonothem
   152  earliestEraOrLowestErathem
   153  latestEraOrHighestErathem
   154  earliestPeriodOrLowestSystem
   155  latestPeriodOrHighestSystem
   156  earliestEpochOrLowestSeries
   157  latestEpochOrHighestSeries
   158  earliestAgeOrLowestStage
   159  latestAgeOrHighestStage
   160  lowestBiostratigraphicZone
   161  highestBiostratigraphicZone
   162  lithostratigraphicTerms
   163  group
   164  formation
   165  member
   166  bed
   167  identificationID
   168  identificationQualifier
   169  typeStatus
   170  identifiedBy
   171  dateIdentified
   172  identificationReferences
   173  identificationVerificationStatus
   174  identificationRemarks
   175  taxonID
   176  scientificNameID
   177  acceptedNameUsageID
   178  parentNameUsageID
   179  originalNameUsageID
   180  nameAccordingToID
   181  namePublishedInID
   182  taxonConceptID
   183  scientificName
   184  acceptedNameUsage
   185  parentNameUsage
   186  originalNameUsage
   187  nameAccordingTo
   188  namePublishedIn
   189  namePublishedInYear
   190  higherClassification
   191  kingdom
   192  phylum
   193  class
   194  order
   195  family
   196  genus
   197  subgenus
   198  specificEpithet
   199  infraspecificEpithet
   200  taxonRank
   201  verbatimTaxonRank
   202  vernacularName
   203  nomenclaturalCode
   204  taxonomicStatus
   205  nomenclaturalStatus
   206  taxonRemarks
   207  datasetKey
   208  publishingCountry
   209  lastInterpreted
   210  elevation
   211  elevationAccuracy
   212  depth
   213  depthAccuracy
   214  distanceAboveSurface
   215  distanceAboveSurfaceAccuracy
   216  issue
   217  mediaType
   218  hasCoordinate
   219  hasGeospatialIssues
   220  taxonKey
   221  kingdomKey
   222  phylumKey
   223  classKey
   224  orderKey
   225  familyKey
   226  genusKey
   227  subgenusKey
   228  speciesKey
   229  species
   230  genericName
   231  typifiedName
   232  protocol
   233  lastParsed
   234  lastCrawled
   235  repatriated

$sed -n "1943818p; 1943819q" occurrence.txt | tr '\t' '\n' | cat -n
     1  865658032
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27  http://data.rbge.org.uk/herb/E00389856
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37  CC_BY_NC_4_0
    38
    39
    40  2013-01-08T01:00Z
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55  PhysicalObject
    56
    57
    58  http://biocol.org/urn:lsid:biocol.org:col:15670
    59
    60  E
    61  E
    62  Royal Botanic Garden Edinburgh Herbarium
    63  E
    64  PRESERVED_SPECIMEN
    65
    66
    67
    68  http://data.rbge.org.uk/herb/E00389856
    69  E00389856
    70  5425
    71  Lace, John Henry
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81  herbarium specimen of unspecified type
    82
    83
    84
    85
    86  BGBASE:553173
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
   100
   101  258
   102
   103
   104
   105
   106  September 1911
   107
   108
   109
   110
   111
   112
   113
   114
   115
   116  South Asia
   117
   118
   119
   120
   121  MM
   122  Mandalay
   123
   124  Pyin Oo Lwin District
   125
   126  Maymyo Plateau
   127
   128  3500 FT
   129
   130
   131
   132
   133
   134
   135
   136
   137
   138
   139  degrees minutes seconds
   140
   141
   142
   143
   144
   145
   146
   147
   148
   149
   150
   151
   152
   153
   154
   155
   156
   157
   158
   159
   160
   161
   162
   163
   164
   165
   166
   167
   168
   169
   170
   171
   172
   173
   174
   175
   176
   177
   178
   179
   180
   181
   182
   183
   184  Zingiber Mill.
   185
   186
   187
   188
   189
   190
   191
   192  Plantae
   193  Tracheophyta
   194  Liliopsida
   195  Zingiberales
   196  Zingiberaceae
   197  Zingiber
   198
   199
   200
   201  GENUS
   202
   203
   204  ICBN
   205
   206
   207
   208  bf2a4bf0-5f31-11de-b67e-b8a03c50a862
   209  GB
   210  2016-08-04T15:08Z
   211  1067.0
   212  0.0
   213
   214
   215
   216
   217  RECORDED_DATE_UNLIKELY
   218
   219  false
   220  false
   221  2756693
   222  6
   223  7707728
   224  196
   225  627
   226  4687
   227  2756693
   228
   229
   230
   231  Zingiber
   232
   233  DWC_ARCHIVE
   234  2015-06-26T10:52Z
   235  2016-09-13T12:10Z

    


Author: mblissett
Created: 2016-09-29 15:32:14.134
Updated: 2016-09-29 15:32:14.134
        
I ran this (in Zsh) on all Nicky's recent downloads, to find the issues:

{code}
for i in *.zip; do echo $i; unzip -p $i occurrence.txt | tr -d -c $'\t\n' | awk '{ print length }' | grep -v 234 | sort | uniq -c; echo; done
for i in *.zip; do echo $i; unzip -p $i $i:r.csv | tr -d -c $'\t\n' | awk '{ print length }' | grep -v 43 | sort | uniq -c; echo; done
{code}

This detects problems with these two: 0012956-160910150852091.zip 0010455-160910150852091.zip

* embedded tabs (some lines with >234 tabs)
* embedded newlines (many more lines with <234 tabs) i.e. PF-2625

Although it doesn't pick up Nicky's example (thanks for the undeniable proof), where the line with an embedded tab has the final field missing.

    


Author: cgendreau
Created: 2016-09-30 09:48:03.138
Updated: 2016-09-30 09:48:03.138
        
String are not sanitized properly in "big download" (around 200 000 records). It explains the different result for the same record in 2 different archives (they are not assembled the same way).

It is now fixed:
https://github.com/gbif/occurrence/commit/06cb96aa5f27051c16a5a989efc6b66e46493bc2

Thanks for reporting the issue.
    


Author: nickyn
Comment: GBIF occurrence ID 295234012 includes a tab in a field included in the simple CSV download format (locality)
Created: 2016-10-04 13:53:23.741
Updated: 2016-10-04 13:53:23.741


Author: cgendreau
Comment: The fix is now deployed and I have tested occurrenceId 865658032 in a "big download". The field(s) are properly sanitized. 
Created: 2016-10-04 16:56:22.578
Updated: 2016-10-04 16:56:22.578