16026
Reporter: ahahn
Assignee: fmendez
Type: Bug
Summary: Download reports hits, but does not contain data
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2014-07-02 10:36:04.965
Updated: 2014-10-03 16:48:59.337
Resolved: 2014-10-03 16:17:11.863
Description: A download for
Taxon Aythya fuligula (Linnaeus, 1758)
contains meta.xml and metadata.xml, but no further files, and no occurrence data. The message in the interface reports
Ready for download (4.8 kB 530,718 records - 0 datasets)
Download link: api.gbif.org/v0.9/occurrence/download/request/0002498-140616093749225.zip
]]>
Author: kbraak@gbif.org
Comment: [~fmendez@gbif.org] can you please take a look?
Created: 2014-07-09 10:38:43.046
Updated: 2014-07-09 10:38:43.046
Author: kbraak@gbif.org
Created: 2014-10-02 17:58:21.605
Updated: 2014-10-02 17:58:21.605
Rerun, to try and reproduce once more.
Download is:
http://api.gbif.org/v1/occurrence/download/0000063-141002143147142
Filtered search is:
http://www.gbif.org/occurrence/search?TAXON_KEY=2498261
Author: kbraak@gbif.org
Created: 2014-10-02 18:22:20.663
Updated: 2014-10-02 18:23:24.084
Download was successful, but the number of records in the download did not match the number of records as reported in the email/occurrence search:
dhcp-17:Desktop kbraak$ wc -l occurrence.txt
592145 occurrence.txt
Versus:
Your download 0000063-141002143147142 is ready at the following address: http://api.gbif.org/v1/occurrence/download/request/0000063-141002143147142.zip (77.2 MB - 593,169 records - 0 datasets)
Author: mdoering@gbif.org
Created: 2014-10-02 18:35:02.143
Updated: 2014-10-02 18:39:23.048
I too always have slightly different counts from solr, the metrics in the download and the wordcount done on the occurrence file (minus 1 for the header):
-----
Solr: 66.681
Download stat: 66.679
wc -l: 66.679
http://www.gbif.org/occurrence/search?TAXON_KEY=3026372
http://api.gbif.org/v1/occurrence/download/request/0000013-141002143147142.zip
-----
Solr: 503.957
Download stat: 503.914
wc -l: 500.692
http://www.gbif.org/occurrence/search?TAXON_KEY=3026372
http://api.gbif.org/v1/occurrence/download/request/0000014-141002143147142.zip
-----
Solr: 1.414.881
Download stat: 1.414.811
wc -l:
http://www.gbif.org/occurrence/search?TAXON_KEY=3026372
http://api.gbif.org/v1/occurrence/download/request/0000016-141002143147142.zip
Author: trobertson@gbif.org
Created: 2014-10-02 18:59:33.302
Updated: 2014-10-02 18:59:33.302
With all the crashing, it is now highly unlikely that we have consistent counts.
The occurrence_hdfs table has not built for 36hrs, harvesting and deletions have been going on, so the download source table (occurrence_hdfs) is not consistent with SOLR. Tonight at 05:00 CPH it will rerun, so we can see then.
Another inconsistency is that SOLR could be out of sync with HBase as we've crashed so much, that flushes may not have flushed.
After tonights HDFS rebuild let's see the results.
Kyle has also identified some queries that return 0 records, when they should return records. That is a separate issues.
{code}wc -l{code} will produce a count 1 if not 2 difference from solr. 1 for the header row, and I *think* there is an extra carriage return at the end.
Author: trobertson@gbif.org
Created: 2014-10-03 16:17:11.901
Updated: 2014-10-03 16:17:11.901
Closing this issue. The original issue was obviously a registry issue (0 datasets).
Commentary that started later in this thread related to downloads of Aythya fuligula [1] to test the new downloads using the parallel compression. These were done at a time when the table used for download was 2-3 days out of sync with live updates and SOLR, due to all the crashing.
Today we have run many tests and confirmed downloads to return the same number of records as in the source database. For this download we get 593,169 records (1 line extra in downloaded files for the header) which is confirmed the same as in HBase, the download table and SOLR.
To anyone who has been commenting (e.g. @markus) please log each new example as an individual issue, and provide correct links as here the links were all the same. Please note, anything downloaded before 06:00 today is known to be incorrect, so please only log issues related to new downloads - e.g. reproduce them.
Thanks [~fmendez@gbif.org] for helping test all this
[1] http://www.gbif.org/occurrence/search?TAXON_KEY=2498261