Issue 12622

occurrence download: handle requests to download entire index

12622
Reporter: kbraak
Type: Improvement
Summary: occurrence download: handle requests to download entire index
Priority: Major
Resolution: WontFix
Status: Closed
Created: 2013-01-22 16:12:01.871
Updated: 2014-01-17 10:45:57.281
Resolved: 2014-01-07 15:41:42.991
        
Description: Visit the entry point to occurrence searches and downloads: http://staging.gbif.org:8080/portal-web-dynamic/occurrence/?url=occurrence

Click to search all 391,317,043 occurrences:  http://staging.gbif.org:8080/portal-web-dynamic/occurrence/search

Now imagine curious users will try to download all records.

Currently there is nothing to stop them.

We (probably) need to:

Avoid running the job, linking the user to an existing download dump

[~ahahn] perhaps you would agree we also need to have the user agree to the data usage agreement customary with receiving access to the complete index dump? Can you please lead an investigation into this? Thanks]]>
    


Author: mdoering@gbif.org
Created: 2013-01-22 16:32:44.769
Updated: 2013-01-22 16:32:44.769
        
It was agreed with Tim full downloads without a filter can be run. But obviously we should make use of caching in this case and probably others too. Move this to a download issue?

Should the data usage agreement  have been signed when a user registers?
    


Author: ahahn@gbif.org
Created: 2013-01-22 17:10:32.357
Updated: 2013-01-22 17:15:45.853
        
This is only in part a technical issue, there is a big political one behind it. I don't think we can just wave it away as "people can download all data even now, if they page over them". At the very minimum, we need to ensure proper communications around this, and not let people find out after the fact that full download of the index is possible, but that is a different level of discussion. This is mainly in the context of data attribution and feedback on usage.

In practical terms, many downloads will have to run asynchronuously - a full download is only gradually harder on the system than a 90% download, so that this part indeed fits better into a download issue.

Concerning the data usage agreement:
- On the design of the current data portal we made sure that any user (even just browsing) has to confirm the data use agreement on first use by clicking on a "confirm" link. In this sense, I would agree that the data usage agreement also has to be signed by any downloader in the new portal as the very minimum. In addition, each download needs to be accompanied by the citations and rights statements of all concerned datasets, and the downloader should receive a reminder of citation requirements, together with a link to the appropriate documents

- Full dump access currently is ruled by signature of a "Letter of Agreement", laying out the terms and conditions, like feedback on data use, attribution of source data etc, between two named partners (the applying institution and GBIF). If this procedure is made redundant, relevant GBIF SciStaff and possibly committees need to be involved beforehand - this is not an IT decision.
    


Author: mdoering@gbif.org
Created: 2013-01-22 22:21:56.16
Updated: 2013-01-22 22:21:56.16
        
I agree it would be best to raise this with SciStaff again, but my understanding is that users can download any data they like, whether this is a subset or the whole index. Anyway would it make much of a difference to download all data or just the 90% of Animalia? Rather than treating downloading the full index special it should be sufficient to

- track each download incl the registered user who initiated it
- make a user sign some agreement, either at the time of registering or when doing his first download. It is common practice to agree to terms and conditions when registering.

We definitely need a download issues for the caching of results for the same predicates for a while until the index changes. We can discuss IPR and UI related problems here still
    


Author: kbraak@gbif.org
Created: 2014-01-07 15:41:43.025
Updated: 2014-01-07 15:41:43.025
        
POR-966 remains, to deal with caching the downloads for the same predicates for a while until the index changes.

Otherwise, there is apparently no problem downloading the entire index, therefore I'm closing this issue as won't fix.