11524
Reporter: mdoering
Assignee: mdoering
Type: Improvement
Summary: Improve solr index build performance
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-06-29 10:29:36.756
Updated: 2013-12-09 14:01:27.796
Resolved: 2012-06-29 23:02:31.333
Description: Currently it takes 32h on jawa and 61h on boma.
Investigate why that is so different and what the bottlenecks are and try to remove them.
Initial investigations with YourKit showed:
* 10% of total time is spend on loading the single vernacular name of a name usage which is not needed as we load all of them anyway
* the bonecp conncetion pool doesn't seem to be fine. 2% of the time is spent on getting a connection when the pool should only reuse existing ones. Also there is a large number of blackings occurring in the db pool
* the nub usage components (vernacular names, distributions, descriptions, species profiles) are loaded twice. Once through the nub usage and once through their original usage. See issue CLB-157]]>
Author: mdoering@gbif.org
Created: 2012-06-29 11:26:21.607
Updated: 2012-06-29 11:26:21.607
Comparing an explain analyze of the same query on boma vs jawa shows that they are not that much different:
{noformat}
JAWA: cost=1555.84, runtime=0.445ms
BOMA: cost=1159.56, runtime=0.585ms
explain analyze SELECT v.id, v.usage_fk, chk.resource_key, vn.vernacular_name as vernacularName, v.language, v.country, area.term as area, null as source, null as preferred, null as plural , tsex.preferred_term_fk as sex, tlife.preferred_term_fk as lifestage FROM vernacular_name v join name_usage u on v.usage_fk=u.id join checklist chk on u.checklist_fk=chk.id left join vernacular_name_string vn on v.name_string_fk=vn.id left join term area on area_fk=area.id left join term tsex on sex_fk=tsex.id left join term tlife on life_stage_fk=tlife.id WHERE u.nub_fk=212 and v.language='en' and vn.blacklisted=false ORDER BY v.id LIMIT 1;
{noformat}
Comparison of the postgres settings jawa vs boma:
{noformat}
JAWA / BOMA
sync = off / on
work_mem = 832MB / 128MB
shared_buffers = 13GB / 32GB
maintenance_work_mem = 1GB / 8GB
{noformat}
Author: mdoering@gbif.org
Created: 2012-06-29 15:59:42.295
Updated: 2012-06-29 15:59:42.295
Removed the need for a vernacular name in the name usage service:
http://code.google.com/p/gbif-ecat/source/detail?r=5186
Author: mdoering@gbif.org
Created: 2012-06-29 16:04:28.035
Updated: 2012-06-29 16:04:28.035
Bigger refactoring using a single solr server and many, many jobs with each job loading a configurable range of usages in 1 sql statement. Using a size of 10000 usage this reduces the estimated index build time from 60h on boma down to 15.5h only! YourKit profiled more than 60% of the total time on getting the full name usages, now its only 3% !
http://code.google.com/p/gbif-ecat/source/detail?r=5187
Author: mdoering@gbif.org
Created: 2012-06-29 16:05:11.189
Updated: 2012-06-29 16:05:11.189
The seen blocking and large time spent on the connection pool in your kit seems to be an artifact of the probing itself.
Author: mdoering@gbif.org
Comment: Still seeing only 40-45% CPU saturation on boma
Created: 2012-06-29 16:22:42.923
Updated: 2012-06-29 16:22:42.923
Author: mdoering@gbif.org
Created: 2012-06-29 16:32:13.132
Updated: 2012-06-29 16:32:13.132
Jawa:
Centos 6 (with ext4)
2 x Intel® Xeon® Processor E5520 http://ark.intel.com/products/40200/Intel-Xeon-Processor-E5520-(8M-Cache-2_26-GHz-5_86-GTs-Intel-QPI)
64 GB RAM
Hardware RAID Perc 6i controller
8 x 73GB SAS 15k drives RAID50
Boma:
Cenots5 (with ext3)
2 x Intel® Xeon® Processor X7460
http://ark.intel.com/products/36947/Intel-Xeon-Processor-X7460-(16M-Cache-2_66-GHz-1066-MHz-FSB)
128 GB RAM
Hardware RAID Perc 6i controller
8 x 146GB SAS 15k drives; most probably RAID 10 but will have to take a peek at the controller to confirm (reboot required)
Author: trobertson@gbif.org
Created: 2012-06-29 16:35:20.076
Updated: 2012-06-29 16:35:20.076
I find it incredibly difficult to believe that finding out the disk arrangement on a linux machine requires a reboot
Author: trobertson@gbif.org
Created: 2012-06-29 18:23:55.382
Updated: 2012-06-29 18:23:55.382
Thanks to Cip:
The short version:
... it’s ... RAID 10 with a 64KB strip size.
The longer version:
Finding the actual disk configuration on a Linux machine where the disks are “hidden” beneath a hardware controller, with 100% certitude, is indeed difficult as NATIVE Linux tools cannot talk directly to the storage controller
In order to get that kind of low hardware level access one must install the proprietary tools from the vendor, in this particular case NOT From Dell as they have tools geared towards Windows mainly; LSI themselves provide some tools but those tools bring their own libraries and precompiled binaries. :-s
Hence the dilemma: complicate the OS installation with additional libraries and precompiled binaries (again: NOT from the official distro) or simply stay with the controller BIOS option??! Considering this is the second time we’ve had this problem ever since we’ve bought this box 4 years ago, we thought it’ll be a safer approach to leave out the precompiled binaries from the vendor and rely on the BIOS.
… OK to hell with it, I’ve now installed the LSI tools; it’s confirmed RAID 10 with a 64KB strip size.
Author: mdoering@gbif.org
Created: 2012-06-29 18:51:13.826
Updated: 2012-06-29 18:51:13.826
Improved indexer to use usage key ranges also for all 4 components
http://code.google.com/p/gbif-ecat/source/detail?r=5189
Lets see what performance we are getting now...
Author: mdoering@gbif.org
Comment: Oh well, it is 1 minute 10s for a single 10000 usage batch to index and it is estimated to take 55 minutes now only for the complete index build!!!
Created: 2012-06-29 19:01:16.159
Updated: 2012-06-29 19:01:40.127
Author: mdoering@gbif.org
Created: 2012-06-29 19:09:34.512
Updated: 2012-06-29 19:09:34.512
And now we start to see blocking in solr/lucene.
80% of the time is spend in solr now. We could revive the multiple index writers again in case we want some more speed, but for now I am inclined to leave it where we are.
Author: trobertson@gbif.org
Created: 2012-06-29 22:58:56.134
Updated: 2012-06-29 22:58:56.134
Nice. 29 mins on Jawa with 16 threads and 10k page sizes. Running 36 threads and 36 connections seemed to upset PostgreSQL (?) and it started reporting issues on "Maintenance mode" and lots of exceptions.
Author: trobertson@gbif.org
Created: 2012-06-29 23:02:05.5
Updated: 2012-06-29 23:02:05.5
Was still optimizing. Finished now:
INFO [2012-29-06 23:01:17,510+0200] [main] org.gbif.checklistbank.index.NameUsageIndexer: Species Index rebuild completed!
INFO [2012-29-06 23:01:17,511+0200] [main] org.gbif.checklistbank.index.NameUsageIndexer: Time taken run and finish all jobs: 0:32:37.615
Author: trobertson@gbif.org
Created: 2012-06-29 23:02:31.365
Updated: 2012-06-29 23:02:31.365
4 days to 30 minutes.
We can say this is closed
Author: mdoering@gbif.org
Created: 2012-07-11 10:32:36.848
Updated: 2012-07-11 10:32:36.848
The last build on boma expected to take 55 minutes did take 45 minutes at the end:
NameUsageIndexer: Time taken run and finish all jobs: 0:45:12.077
Author: mdoering@gbif.org
Created: 2012-07-17 12:44:36.988
Updated: 2012-07-17 12:44:36.988
The new boma setup with ext4 and a separate pg_xlog took 39 minutes to complete the solr build