Issue 18813

duplicated and missing lines in downloaded txt occurrences

18813
Reporter: sant
Assignee: fmendez
Type: Bug
Summary: duplicated and missing lines in downloaded txt occurrences
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2016-11-21 13:28:51.425
Updated: 2016-11-24 20:42:16.934
Resolved: 2016-11-23 16:04:21.197
        
Description: I have previously reported this issue through emails sent to helpdesk from November 2 (subject: "my GBIF data download is ready ... but WRONG).
I was referring to duplicated lines ((+)) found within downloaded occurrence text files.
But I believe it was misunderstood because in other emails (during the same week) I asked helpdesk to delete 5086 duplicated records ((*) originated by inconsistent provided occurrenceIDs), which is a different question.

I want to follow progress on this, so here is an example:

*REAL DATA (as seen when browsing dataset in GBIF):*
http://www.gbif.org/occurrence/search?CATALOG_NUMBER=730&CATALOG_NUMBER=36259&CATALOG_NUMBER=72688&CATALOG_NUMBER=7&CATALOG_NUMBER=10027&CATALOG_NUMBER=10030&CATALOG_NUMBER=10105&DATASET_KEY=1c334170-7ed1-11df-8c4a-0800200c9a66
*FILTER:* catalognumber in ("7";"730";"10027";"10030";"10105";"36259";"72688")
*RESULT:*

*gbifid / occurrenceID / catalognumber / ScientificName / ... other fields*
  59969596 / SANT:SANT:7-A / 7 / Acer ...
  59969612 / SANT:SANT:730-A / 730 / Achillea ...
(*) 59972388 / SANT:SANT:10027 / 10027 / Andryala ...
  1324942751 / SANT:SANT:10027-A / 10027 / Andryala ... (same as gbifid 59972388)
  60003744 / SANT:SANT:10030-A / 10030 / Hypericum ...
(*)  60005974 / SANT:SANT:10105 / 10105 / Leontice ...
  1324942753 / SANT:SANT:10105-A / 10105 / Leontice ...  (same as gbifid 60005974)
(*) 234773272 / SANT:SANT:36259 / 36259 / Serratula ...
  1324943549 / SANT:SANT:36259-A / 36259 / Serratula ... (same as gbifid 234773272)
(*) 1305198572 / SANT:SANT:72688 / 72688 / Zostera ...
  1324947540 / SANT:SANT:72688-A / 72688 / Zostera ... (same as gbifid 1305198572)

I had requested 7 numbers but I got 11 records because there are 4 duplicates caused by 4 wrong occurrenceIDs* uploaded in the past (my bad)

*DOWNLOADED DATASET:*
http://www.gbif.org/occurrence/download/0030359-160910150852091

{color:red}NOW THE ODD THING: If I import the CSV text to a database and filter by the same 7 catalognumbers, I got 17 records instead of 11:{color}
*FILTER:* catalognumber in ("7";"730";"10027";"10030";"10105";"36259";"72688")
*RESULT:*

*gbifid / occurrenceID / catalognumber / ScientificName ... other fields*
  59969596 / SANT:SANT:7-A / 7 / Acer ...
(+) 59969596 / SANT:SANT:7-A / 7 / Acer ...
  59969612 / SANT:SANT:730-A / 730 / Achillea ...
(+) 59969612 / SANT:SANT:730-A / 730 / Achillea ...
(*) 59972388 / SANT:SANT:10027 / 10027 / Andryala ...
  1324942751 / SANT:SANT:10027-A / 10027 / Andryala ... (same as gbifid 59972388)
  60003744 / SANT:SANT:10030-A / 10030 / Hypericum ...
(*)  60005974 / SANT:SANT:10105 / 10105 / Leontice ...
(+)(*)  60005974 / SANT:SANT:10105 / 10105 / Leontice ...
  1324942753 / SANT:SANT:10105-A / 10105 / Leontice ...  (same as gbifid 60005974)
(*) 234773272 / SANT:SANT:36259 / 36259 / Serratula ...
(+)(*) 234773272 / SANT:SANT:36259 / 36259 / Serratula ...
  1324943549 / SANT:SANT:36259-A / 36259 / Serratula ... (same as gbifid 234773272)
(+) 1324943549 / SANT:SANT:36259-A / 36259 / Serratula ... (same as gbifid 234773272)
(*) 1305198572 / SANT:SANT:72688 / 72688 / Zostera ...
  1324947540 / SANT:SANT:72688-A / 72688 / Zostera ... (same as gbifid 1305198572)
(+) 1324947540 / SANT:SANT:72688-A / 72688 / Zostera ... (same as gbifid 1305198572)

There are 6 repited rows((+)) which I understand as a bug at GBIF download file generation.
I call these "fake duplicates" because they are not really present when you browse dataset. Only when you download it.

Some things I have found after several downloads from the same unchanged dataset
 (IPT published version did not change between my different download tests):

- +Problem happens to both CSV and DwCa downloads+ (although most of my tries have been CSV)
- +Repetitions seem to be random+. If I repeat the download request, the new generated file is not identical (the repeated lines may be other)
- +The number of repetitions also changes between different download requests+ (so the above example filter wouldn't always return 17 rows).
   In this example, the linked download contains +16726 repeated lines and 58574 not repeated. Total = 75300
- +The repeated records always have only one copy+ (I never found the same line 3 or more times)
- Repetitions((+)) occur to all kind of records: no matters if they were real duplicates((*)) pending to be deleted or not.
- +The total number of lines+ in the file does not change. It always +matched the expected number of records+
   (which is equal to the total number of different occurrenceIDs uploaded by data provider)
   In our case, 75300 records (70214 in the IPT-mapped DB, plus 5086 which are real duplicates((*)) which we have reported to GBIF for deletion).

As a conclusion from all that: we cannot get the whole dataset because in every download we try, some repeated data are replacing some real data

Other datasets I tried to download do not show this problem, which is pretty annoying because users cannot trust our dataset.

Thanks a lot in advance for your help
David

]]>
    


Author: fmendez@gbif.org
Comment: We'll re-index this dataset, it had a very inconsistent state between the occurrence storage (HBase) and the search index (Solr), both reported the same amount of records but different data, this could happened if we processed deletions while crawling the dataset at the same time
Created: 2016-11-23 10:09:50.531
Updated: 2016-11-23 10:09:50.531


Author: sant
Created: 2016-11-23 12:15:50.214
Updated: 2016-11-23 12:15:50.214
        
Thanks a lot Federico

You mean we published new data through IPT at the exact moment you were deleting those 5086 records. Right?

I was not aware of that possibility.  How can I avoid it to happen again?

Once I report occurrenceIDs which need to be deleted, I can't know the exact moment when GBIF will process deletions.
Should I stop updating our dataset and republishing IPT until I receive an email back from GBIF?

Suggestion:

Don't you have means to avoid a particular dataset to be crawled during some time?
So crawling cannot occur even if I update IPT dataset version.
Later on, GBIF could force crawling to happen after deletions finished (just in case IPT version was updated during that time).

Please let me now when I can republish our dataset again.
Thank you very much

David

    


Author: fmendez@gbif.org
Comment: I'm still analyzing that dataset, it seems that  I was wrong, I have re-indexed that dataset using the latest Darwin archive and it seems to be fine, probably there's an error that we are not catching while indexing it.
Created: 2016-11-23 13:09:52.049
Updated: 2016-11-23 13:09:52.049


Author: fmendez@gbif.org
Comment: David, please test your downloads again, we found the issue in how we order the results for small size downloads, please contact us/me if you find any problem with your data.
Created: 2016-11-23 16:04:16.488
Updated: 2016-11-23 16:04:16.488


Author: sant
Created: 2016-11-24 00:23:25.953
Updated: 2016-11-24 00:23:53.065
        
All seems to be correct now. Thanks.

So, is it safe to republish an IPT dataset at any time?
Even if the portal contains records pending to be deleted?


    


Author: fmendez@gbif.org
Comment: yes it is, sorry for the confusion, but my initial diagnosis was wrong, this issue had nothing to do with deletions and indexing, the problem was in the search routines that we use to collect data for downloads
Created: 2016-11-24 08:47:56.881
Updated: 2016-11-24 08:47:56.881


Author: trobertson@gbif.org
Comment: Please be aware that with the huge new eBird dataset going through now, other datasets are unlikely to be updated until the 5th December 
Created: 2016-11-24 09:24:52.89
Updated: 2016-11-24 09:24:52.89


Author: sant
Created: 2016-11-24 20:39:45.122
Updated: 2016-11-24 20:39:45.122
        
Tim, do you mean requested deletions will not be processed until 5th Dec.?

Or do you mean IPT changes will not be indexed by portal until that date? (republished, migrated and/or new datasets)

Or both things?
    


Author: trobertson@gbif.org
Comment: Both things most likely.  There are 200M records on the queue being processed still.
Created: 2016-11-24 20:42:16.934
Updated: 2016-11-24 20:42:16.934