Issue 17600

Checklist indexing fails on brazilian flora checklist

17600
Reporter: mdoering
Assignee: mdoering
Type: Bug
Summary: Checklist indexing fails on brazilian flora checklist
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2015-05-27 16:31:13.708
Updated: 2015-05-29 21:06:53.113
Resolved: 2015-05-29 21:06:53.058
        
Description: checklist: http://www.gbif.org/dataset/aacd816d-662c-49d2-ad1a-97e66e2a2908

dwca: http://ipt.jbrj.gov.br/ipt/archive.do?r=lista_especies_flora_brasil

normalization error:
INFO  [2015-05-27 16:26:30,403+0200] [pool-9-thread-2] org.gbif.checklistbank.cli.common.NeoConfiguration: Starting embedded neo4j database from /home/crap/neo/aacd816d-662c-49d2-ad1a-97e66e2a2908
ERROR [2015-05-27 16:26:31,294+0200] [pool-9-thread-2] org.gbif.checklistbank.cli.common.RabbitBaseService: Failed to process dataset aacd816d-662c-49d2-ad1a-97e66e2a2908
java.util.NoSuchElementException: More than one element in org.neo4j.helpers.collection.ResourceClosingIterator$1@681d328f. First element is 'Node[32245]' and the second element is 'Node[32550]'
	at org.neo4j.helpers.collection.IteratorUtil.single(IteratorUtil.java:344) ~[checklistbank-cli.jar:2.15-SNAPSHOT]
	at org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:134) ~[checklistbank-cli.jar:2.15-SNAPSHOT]
	at org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:292) ~[checklistbank-cli.jar:2.15-SNAPSHOT]
	at org.gbif.checklistbank.cli.common.NeoRunnable.nodeBySciname(NeoRunnable.java:84) ~[checklistbank-cli.jar:2.15-SNAPSHOT]
	at org.gbif.checklistbank.cli.normalizer.Normalizer.setupParentRel(Normalizer.java:508) ~[checklistbank-cli.jar:2.15-SNAPSHOT]
]]>
    

Attachment Trimezia.png



Author: mdoering@gbif.org
Created: 2015-05-28 20:50:45.988
Updated: 2015-05-28 20:51:54.021
        
There are at least 2 issues responsible for the indexing failure.
The list uses only verbatim names to express parent & synonym relations. If the name is not unique within the dataset normalization fails.
This is fixed here: https://github.com/gbif/checklistbank/commit/0e041a11e8803085f6b518c11269b9632af7f6c5

In addition there is a transaction issue with neo4j. The code assumes the iterator over all nodes does not iterate over nodes that were created after the iteration started. That is not the case if the transaction is renewed in between which we do to allow performant processing of large datesets with hundred thousands or millions of nodes.

This should be fixed in this commit where we also added a new neo TransactionTest:
https://github.com/gbif/checklistbank/commit/c060f1668d85b141ae2905e4893cb0d32012ce14

There is also a flora of brazil normalizer test now:
https://github.com/gbif/checklistbank/commit/c060f1668d85b141ae2905e4893cb0d32012ce14#diff-c6d5cff1b5ee900f7d1ef112c1383be0R907
    


Author: mdoering@gbif.org
Created: 2015-05-29 15:39:21.364
Updated: 2015-05-29 15:39:21.364
        
Also fixing issue getting a IllegalStateException during import saying that parent usage xyz hasnt been imported yet.
This is for a synonym that has both a parent and synonym relationship. See Trimezia juncifolia in attached neo graph screenshot.
Fixed in https://github.com/gbif/checklistbank/commit/876c5d5869af2eb541e8626038dcccc67192451d

    


Author: mdoering@gbif.org
Comment: gone thru fine in dev: http://www.gbif-dev.org/dataset/0a2f211d-a071-43d4-8211-0897032d15e6
Created: 2015-05-29 21:06:53.11
Updated: 2015-05-29 21:06:53.11