Issue 11505

Bulk load all usage ids in advance when solr indexing

11505
Reporter: mdoering
Assignee: mdoering
Type: Improvement
Summary: Bulk load all usage ids in advance when solr indexing
Description: Paging through all usages is done by selecting ids only first, but still a simple select id from usage query with a limit of 100 and offset of 12 million takes 16s on boma under load. Use the native postgres bulk load to get a primitive int array first and prepare jobs (threads) with a slice of those
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-06-27 09:18:53.4
Updated: 2013-12-09 13:41:02.282
Resolved: 2012-06-28 11:34:42.481


Author: trobertson@gbif.org
Comment: I propose we look at the whole CLB workflows (what it takes to rebuild nub, tie checklists to nub and build the index) with a fresh set of eyes before proceeding.  With a naive view, it appears to be a prime candidate for a simple Oozie workflow (Sqoop, Hive, SOLR build), but like Markus says, that brings the cost of extra maintenance. Since the SOLR schema is so simple, and Ii CLB schema is stable it might be an acceptable cost.  The benefit it likely to be a sub 1hr re-index and hot deploy.
Created: 2012-06-27 09:41:30.252
Updated: 2012-06-27 09:41:30.252