Issue 11657

Implement a Crawling strategy that does not rely on scientific names being present

11657
Reporter: mdoering
Type: Improvement
Summary: Implement a Crawling strategy that does not rely on scientific names being present
Priority: Minor
Resolution: Fixed
Status: Closed
Created: 2012-08-07 17:05:55.332
Updated: 2017-10-06 15:31:38.117
Resolved: 2017-10-06 15:31:38.101
        
Description: We crawl endpoints for a long time using a scientificName based search. This might cause troubles with exotic collations, broken charsets or case sensitive databases.
The biggest criticism we received for this approach though is that records without a scientific name are not indexed at all. In particular for fossils this is often the case, but there are many other cases. Ideally a missing name for an occurrence should not stop us from indexing its location, time and all other properties]]>


Author: lfrancke@gbif.org
Created: 2012-08-07 17:08:50.39
Updated: 2012-08-07 17:08:50.39
        
I didn't know about that, thanks.

Let's rename this issue to something like "Implement a Crawling strategy that does not rely on scientific name".

Any suggestions how we could crawl instead?


Author: lfrancke@gbif.org
Created: 2012-08-07 17:12:49.051
Updated: 2012-08-07 17:12:49.051
        
Exotic collations, broken charsets and case sensitive databases should not be affected though because this is purely range based and starts with everything "less than" "aaa" and ends with everything "greater than" "zzz". There's a review in Crucible (CR-POR-42) and you're welcome to comment on specifics to this strategy there.

Not sure though how databases handle null fields for "greater than" and "less than" searches, that might include null rows?


Author: lfrancke@gbif.org
Comment: MySQL does not include null rows. So it seems like we'd need a different strategy if we want to capture these things. We never specified it but I'd assume that we're not good at handling occurrences without scientific Names in the rest of our workflow either.
Created: 2012-08-07 17:26:59.6
Updated: 2012-08-07 17:26:59.6


Author: mdoering@gbif.org
Created: 2012-08-07 17:28:08.123
Updated: 2012-08-07 17:28:08.123
        
Its also no straight sql, so BioCASE, DiGIR and TAPIR implementations can well add to the complexity. How about we first try some null cases for each of our test installations?

For TAPIR and BioCASE at least you don't need any filter and can just page through all records. But that apparently can get quite slow on some e.g. mysql databases if the offsets get too large. We could try to pick an indexing strategy based on the number of expected records


Author: lfrancke@gbif.org
Created: 2012-08-07 18:37:33.621
Updated: 2012-08-07 18:37:33.621
        
I looked at BioCASe and that translated directly into SQL but no idea if that is different from installation to installation.

Yes I was told that we definitely need to have some "subsetting" to not overload databases.

What do you mean by picking a strategy based on the number of expected records?


Author: mdoering@gbif.org
Comment: The no filter paging strategy would be real simple and clean, but only be possible for small datasets. Maybe up to 100k records. So we could pick such a strategy for small datasets and use the name based crawling for large ones only. But that would not solve the problem for all datasets of course. Couldnt we just filter on the record key; occurrenceID, catalogue number, unit ID or whatever is used? Wouldnt the same strategy we use for name based filtering also work for these which are real mandatory fields?
Created: 2012-08-07 22:30:19.082
Updated: 2012-08-07 22:30:19.082


Author: lfrancke@gbif.org
Created: 2012-08-07 22:33:45.08
Updated: 2012-08-08 10:33:35.021
        
In this first iteration I won't implement "smart strategy" choices but that general idea sparked the architecture in the first place (of being able to change strategies).

As for your suggestion about mandatory fields. Sounds good. Smarter man than I need to answer that though. [~trobertson@gbif.org]?


Author: trobertson@gbif.org
Created: 2012-08-08 10:41:19.547
Updated: 2012-08-08 10:41:19.547
        
This was a deliberate decision, as historically there was a lot of non biodiversity data shared and incorrectly registered in GBIF.
Whether the index should hold content with no scientific identification is not a technical decision and currently the "business" decision is not to.  When criticism arises, please file a content Jira so Jan / Andrea can liaise with those providers and determine if it is content that is really meeting the goals of the organization.


Author: mdoering@gbif.org
Comment: Well, it is criticism from people that dont even care registering datasets because they know it won't appear. As I said I know this is the case for paleontological datasets for example from the MfN in Berlin. Where best to file a jira for wider discussions then - the portal?
Created: 2012-08-08 10:48:35.34
Updated: 2012-08-08 10:48:35.34


Author: trobertson@gbif.org
Comment: DwC-A 
Created: 2017-10-06 15:31:38.115
Updated: 2017-10-06 15:31:38.115