Issue 12315

Possible DwC-A iterator issue in Scala

12315
Reporter: trobertson
Type: Bug
Summary: Possible DwC-A iterator issue in Scala
Priority: Minor
Status: InProgress
Created: 2012-11-13 15:08:36.594
Updated: 2013-12-17 15:45:09.549
        
Description: This was reported by the UK NBN (Paul Gilbertson ) relating to the use with the DwC-A reader in Scala.

{quote}
Possible problem with iterator implementation in DwcA Reader
---------------------------------------------------------------------------------------

From memory, the problem happens when using the standard Scala sequence function "grouped", which groups the sequence into a sequence of sequences.

This code should repro it:

{code}
for (group <- reader.iteratorRaw.grouped(100)) {
 for (record <- group) {
   println(record.core.value(DwcTerm.occurrenceID))
 }
}
{code}

This prints the very last record in the archive, repeatedly, each time. It shouldn't do, so it's possible there's something wrong with the iterator implementation.
{quote}]]>
    
Attachment reader-test.tar.gz


Author: lfrancke@gbif.org
Created: 2012-11-13 15:22:05.756
Updated: 2012-11-13 15:22:05.756
        
From the looks of it this is not Scala specific. {{grouped}} does nothing that a normal Iterator user wouldn't do either.

So in Java this can be reduced to:

{code}
Iterator iter = reader.iteratorRaw();
while (iter.hasNext()) {
  System.out.println(iter.next().core().value(DwcTerm.occurrenceID));
}
{code}

Which is frightening because that's just a standard Iterator usage. I'm going to try and reproduce it.
    


Author: lfrancke@gbif.org
Comment: I've attached a simple project using the dwca-reader to iterate over a file. It has a Java and a Scala version. This does show the problem.
Created: 2012-11-13 15:47:34.41
Updated: 2012-11-13 15:47:34.41


Author: lfrancke@gbif.org
Created: 2012-11-13 15:59:22.363
Updated: 2012-11-13 15:59:22.363
        
I found the problem.

dwca-reader reuses the same object over and over in the iterator and just sets new content every time we call next on it. {{grouped}} now builds a list of 100 records (which are correct at the time they are placed in the list) but those will all be the last element by the time you iterate over them.

{{Archive.iteratorDwc}} actually documents that behavior, {{iteratorRaw}} does not. I fully agree though that this is unexpected behavior, it would have tripped me up as well.

[~mdoering@gbif.org] is there really a huge performance gain by this? I just looked and it seems as if the only thing reused is the {{rowTypes}} stuff. I'm in favor of removing this optimization in favor of a less "surprising" one.
    


Author: mdoering@gbif.org
Created: 2012-11-13 16:45:09.495
Updated: 2012-11-13 16:45:09.495
        
Indeed this is an accepted issue we always wanted to change:
http://code.google.com/p/darwincore/issues/detail?id=157

It was modelled after working a lot with Lucene < 2.9 that had a reusable Token in its Tokenizer - removed since 2.9.
Im happy if someone would remove that behavior!