Issue 11433

Test clb service based indexer and decide which index build approach to keep

11433
Reporter: mdoering
Assignee: mdoering
Type: Improvement
Summary: Test clb service based indexer and decide which index build approach to keep
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-06-18 16:37:36.177
Updated: 2013-12-09 14:01:18.191
Resolved: 2012-07-05 11:02:36.514
        
Description: We have 2 ways of building a species solr index. Pure solr configs with lots of sql in the DIH or a service based approach with java code that maps the model objects in checklistbank-index. The later appears to be 30% slower but has no sql duplication so its much simpler to maintain. Its also less tested.

Test the java approach and decide which index builder to keep]]>


Author: trobertson@gbif.org
Created: 2012-06-18 16:50:21.429
Updated: 2012-06-18 16:50:41.671
        
There are other alternatives worth considering here.  If this is a significant build time (>1hr), I suspect we might improve it with an Oozie workflow which:

a) sqoops in
b) does some hive manipulation (if needed)
c) builds the index
- i) in Hadoop cluster 4, storing on the local file system
- ii) exporting as a file to be indexed using simple SOLR indexing means
- iii) uses something like Katta to hot swap into running distributed SOLR servers (on C4 machines)

This is similar to my ideas for *potential* *future* *simple* (e.g. 5-6 fields) real-time occurrence indexes.


Author: trobertson@gbif.org
Created: 2012-06-18 17:02:49.718
Updated: 2012-06-18 17:02:55.407
        
How long do both approaches take please?
They build an index approximately 15GB if what I have is correct - can you please confirm?


Author: mdoering@gbif.org
Created: 2012-06-18 17:05:32.607
Updated: 2012-06-18 17:05:32.607
        
Leave that to fede, but the bottleneck for sure is simply the db queries in both ways. I can't see how swooping should be quicker to be honest - unless we don't need to scoop at all cause its already in Hive for other reasons.

But if we look at the bigger picture we should maybe consider doing incremental updates to the index in sync with indexing of checklists?


Author: trobertson@gbif.org
Created: 2012-06-18 17:30:20.495
Updated: 2012-06-18 17:30:20.495
        
Keeping indexes in sync would be ideal but will be very difficult due to things like batch building of nubs etc.  Also, transaction boundaries become difficult to manage (if SOLR worked but DB failed, we need to rollback SOLR etc).

The reason Sqoop will be fast is because it will bring in the tables verbatim (most likely <10 mins for the whole CLB) and then calculate the view in hive.  Currently there is a very expensive view being calculated in PostGRES IIRC which is the real bottleneck.  Worth exploring, but I would expect a Sqoop, Hive, MR index build to be in the order of 10-20 minutes.

Currently the SQL view SOLR approach is around 24hrs.


Author: mdoering@gbif.org
Comment: Speed is what we will gain, but maintenance would not be better, maybe even worse, if we scoop out raw tables and do the mapping & logic in hive again. Should be interesting to find the right balance
Created: 2012-06-18 17:39:39.327
Updated: 2012-06-18 17:39:39.327


Author: mdoering@gbif.org
Comment: The java based indexer has been improved enormously and is the only one left now
Created: 2012-07-05 11:02:36.587
Updated: 2012-07-05 11:02:36.587