Issue 17861

Improve Checklist Pg Syncing

17861
Reporter: mdoering
Assignee: mblissett
Type: Bug
Summary: Improve Checklist Pg Syncing
Priority: Critical
Status: Open
Created: 2015-10-07 11:08:22.381
Updated: 2015-12-14 18:48:40.209
        
Description: Checklist usage records are fully assembled in an intermediate neo4j database, but finally need to be synced to the main postgres checklist bank database. This syncing is used both for nub builds and normalized checklists during the clb-import CLI. This is a rather slow process currently and we should try to speed this up.

The syncing per record is initiated here in the Import CLI:
https://github.com/gbif/checklistbank/blob/master/checklistbank-cli/src/main/java/org/gbif/checklistbank/cli/importer/Importer.java#L169

This calls the following mybatis service method which is the single entry point to the postgres syncing:
https://github.com/gbif/checklistbank/blob/master/checklistbank-mybatis-service/src/main/java/org/gbif/checklistbank/service/mybatis/DatasetImportServiceMyBatis.java#L146



Ideas for improvements:

 - store and compare hashes for the 4 main objects that need syncing before actually engaging in sql updates: NameUsage, VerbatimNameUsage, NameUsageMetrics, UsageExtensions
 - parallelize syncing in the Importer. When iterating over the neo4j nodes in taxonomic order (from root usages to children) look at the number of descendants and put batches on a syncing queue.
]]>
    


Author: mdoering@gbif.org
Comment: What did the profiling yield, any idea where the thread blocking came from? We should profile the latest code which is using neo4j in read only mode
Created: 2015-12-14 18:48:40.209
Updated: 2015-12-14 18:48:40.209