Issue 12034

Decide on the technology to provide the answer to "occurrence datasets for a nub key"

12034
Reporter: trobertson
Assignee: mdoering
Type: Task
Summary: Decide on the technology to provide the answer to "occurrence datasets for a nub key"
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2012-10-16 14:48:21.952
Updated: 2013-08-29 14:45:11.929
Resolved: 2012-11-16 18:36:23.762
        
Description: On the portal species pages, we have related datasets.  So for any taxon, we can list the datasets which have occurrences for this species.

Thus we need an index that can answer this.

Background: CLB has a table that holds this mapping, but it was never populated correctly, and relied on an Oozie coordinated job to populate it.  This means batch processing.

Now that we are moving to a near real time system we want to keep this index *in sync*.  There are some obvious options:

i) Use the facets in the occurrence record index (currently this is far too slow to use), but SOLR can answer this (e.g.)
 http://jawa.gbif.org:8080/occurrence-solr/select/?q=class_key:212&rows=0&facet=true&facet.field=dataset_key&facet.mincount=1
This would be the ideal solution as the structures exist, if only this were performant.

ii) Keep an index in a relational database, and update this as we go

iii) Expand the registry SOLR index with a multivalued field for nub keys (remember higher taxa are also nub keys) so it does not need facets (as per i) )

iv) Explore using DataCube for this, such that the Dimension would be the nub key, and the operation would be a serialized Set.  This requires research into the feasibility.

Let the discussion begin...

[~fmendez] [~mdoering] [~omeyn] [~jcuadra] [~kbraak] [~lfrancke]







]]>
    


Author: mdoering@gbif.org
Created: 2012-10-16 15:08:37.515
Updated: 2012-10-16 15:08:37.515
        
deprecates these issues:
http://dev.gbif.org/issues/browse/ROL-10
http://dev.gbif.org/issues/browse/ROL-21
    


Author: fmendez@gbif.org
Created: 2012-10-16 15:15:07.85
Updated: 2012-10-16 15:55:09.832
        
we have 1 more option:
v) expand the NameUsage Solr index with a multivalued field for dataset keys where the name has been used.
    


Author: trobertson@gbif.org
Comment: Cannot do the related datasets metrics in the species pages without actually having the list of datasets for a nub key
Created: 2012-10-16 15:15:37.688
Updated: 2012-10-16 15:15:37.688


Author: fmendez@gbif.org
Comment: The best y to decide if a technology is feasible for this is by testing it. If everybody agree, I'd like to test a simple Solr index with the following structure: <field ...name=datasetkey> and <field multivalue=true name=occurrencekey>. I can try to build the index reading all datasets from the registry solor index and the occurrences from the occurrence-index, sounds like an easy thing to implement, is gonna be a bit slow, but that's ok since later we'll build it incrementally....if the index works, later we can put it in, either, the registry-solr server or in occurrence-solr server; so, we'll have in 1 of those 2 Solr cores (the existing one plus the new datasets-occurrences core): 1 core is basically a "table" (based on a Solr schema); 1 Solr server can have several cores, each core has its own schema and data. Is someone has a idea of other technology that can be used, that person can do a similar test with that particular technology.
Created: 2012-11-01 13:18:40.674
Updated: 2012-11-01 13:18:40.674


Author: lfrancke@gbif.org
Created: 2012-11-01 13:57:39.022
Updated: 2012-11-01 13:57:39.022
        
Tim explained this all over to me and here are my votes:

* If it turns out that the existing Occurrence index can answer the question in a reasonable time (<2s or so) then use that
* Use HBase/Datacube for this. It does not feel like a Solr thing but mostly like a counter/metrics thing
    


Author: trobertson@gbif.org
Created: 2012-11-01 14:14:07.364
Updated: 2012-11-01 14:19:56.481
        
Thanks Fede, but I'll continue with this issue.

I would like to test the following SOLR schemas for read performance:
  #1 doc type of occurrence with multivalue nubKey and datasetKey (exists already, but slow)
  #2 doc type of dataset with multivalue nubKey (gets complex to do # records by dataset, and updates become expensive)
  #3 doc type of name_usage with multivalue of occurrenceDatasetKey (gets complex to do # records by dataset, and updates become expensive)

I'll start by investigating a DataCube structure to handle this

    


Author: trobertson@gbif.org
Comment: DataCube appears to handle this well, and commited with http://code.google.com/p/gbif-metrics/source/detail?r=83
Created: 2012-11-16 18:36:23.787
Updated: 2012-11-16 18:36:23.787