Issue 18059

Verify new backbone is acceptable for production

18059
Reporter: mdoering
Assignee: mdoering
Type: Task
Summary: Verify new backbone is acceptable for production
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2015-12-01 12:56:32.652
Updated: 2016-04-20 10:23:16.317
Resolved: 2016-04-20 10:20:04.808
        
Description: When we create a new backbone we need to verify it is suitable for production.

Think and implement ways to ensure we do not regress our taxonomic backbone.

Consider implementing the existing NubAssertions class to test a newly generated backbone. If tests fail the new backbone will not be synced to postgres automatically. Add various tests to the assertion class with a focus on individual taxon tests verifying known problematic taxa, e.g. the Oenanthe homonyms and stable ids for some taxa.

In addition other tools to screen a backbone would be useful, e.g. smart diff visualizations

The biggest source of tracking important changes in the backbone is what effect a new nub has on matched occurrences.
When a new candidate is ready we need to rematch all (distinct) occurrences and evaluate the changed matches and classifications. Drastic changes like a different kingdom need to be monitored (there are many bad taxa in our current backbone with a wrong kingdom, so such changes might actually be an improvement!)]]>
    

Attachment DiffMerge.png


Attachment FileMerge.png


Attachment test-without-authors.png

Attachment families-col.html
Attachment families.html
Attachment test2b.gml
Attachment test2.edit
Attachment test2.gml
Attachment test2.html
Attachment test-without-authors2.gml
Attachment test-without-authors.dot
Attachment test-without-authors.edit
Attachment test-without-authors.gml
Attachment test-without-authors.html


Author: rdmpage
Comment: [~mdoering@gbif.org] Regarding "In addition other tools to screen a backbone would be useful, e.g. smart diff visualizations" I did some work a while back on computing the difference between two classifications http://dx.doi.org/10.1186/1471-2105-6-208. In principle this could be used to create visual diffs, although might want to apply to subtrees rather than entire nub (and/or at various levels). Could imagine doing this recursively and displaying a treemap with colour coding to flag changes (e.g., taxa which are in different positions in the two versions). Is this something that might be of interest?
Created: 2015-12-01 21:41:18.007
Updated: 2015-12-01 21:41:18.007


Author: mdoering@gbif.org
Comment: Thanks Rod, this is of interest indeed. Also just any visual ideas of how best to show diffs would be great. Is it worth to export our backbone to GML to try your tools?
Created: 2015-12-02 11:58:29.498
Updated: 2015-12-02 11:58:29.498


Author: rdmpage
Created: 2015-12-02 12:30:15.322
Updated: 2015-12-02 12:30:15.322
        
Was afraid you'd say that. Let me dig out the code and se if I can build it (C++ tends to suffer bit rot). The repository is here if you're curious: https://github.com/rdmpage/forest

I was thinking of perhaps comparing two trees, and being able to show what of the original tree has changed, and what the new tree has introduced. The edit-script idea might also be of interest in that it could be used to examine what taxa have been changed/deleted. Could use yEd to create a visualisation, GML

Scalability may be an issue, how about we have a couple of small- to mid-sized old-nub and new-nub examples, say with 100 or 1000 nodes? And/or might be useful to have taxa that we know have issues, such as the subtree rooted on "Aves" in the old and new nub (this will be bigger than 1000 nodes, obviously, but would be a good test). 
    


Author: mdoering@gbif.org
Created: 2015-12-02 13:16:40.238
Updated: 2015-12-02 13:16:40.238
        
Aves would be good, yes. If its too small Im not sure if it is very helpful. A subtree of just 100 taxa is rather easy to compare with human eyes. It is the larger trees that cause trouble to the human eye. Detecting moved entire subtrees, e.g. a tribe would be good to understand.

Maybe a standard xml diff yields some insight? 
    


Author: rdmpage
Comment: The method I have in mind is pretty close to what an XML diff would do. I'll see if I can make some test files here using the version of nub that I have and see how the code performs. Agree that it would be nice to treat big trees, I'd just like to test this before I claim it's doable. 
Created: 2015-12-02 16:13:51.276
Updated: 2015-12-02 16:13:51.276


Author: rdmpage
Comment: Hand-made demo for small tree here: https://dl.dropboxusercontent.com/u/639486/f.html The difference between two trees is computed, then I've hand coloured the trees base don the diff script to see what it looks like. This step could also automated, so that user would simply need to provide two trees in GML format and output would be browseable differences.
Created: 2015-12-03 16:15:11.732
Updated: 2015-12-03 16:15:11.732


Author: mdoering@gbif.org
Comment: This looks pretty neat, Rod. Does the script detect a moved entry by considering equal name strings the same?
Created: 2015-12-04 10:38:59.293
Updated: 2015-12-04 10:38:59.293


Author: mdoering@gbif.org
Created: 2015-12-04 10:41:20.152
Updated: 2015-12-04 10:41:20.152
        
I am experiment here with regular diff tools:
https://github.com/mdoering/tree-diff

Simply using githubs diffs and blame functionality is already pretty useful.
I have compared the original Mantodea tree with the nub version of it:

https://github.com/mdoering/tree-diff/tree/master/mantodea
https://github.com/mdoering/tree-diff/blame/master/mantodea/tree.txt
https://github.com/mdoering/tree-diff/commit/16f88f12e925d7f5cb22b06b9690aa00786bcbda
    


Author: mdoering@gbif.org
Created: 2015-12-04 11:00:43.037
Updated: 2015-12-04 11:04:26.112
        
Unfortunately github fails to show larger diffs.
But local diff tools like DiffMerge or Apples FileMerge do a good job in quickly visualising the differences

!DiffMerge.png!

!FileMerge.png!
    


Author: rdmpage
Comment: Looks like standard diffs do a reasonable job. I might mess about with my tool a bit more out of curiosity. It assumes leaves with the with the same name are the same thing, but internal nodes can have different sets of descendants and still be the "same". It's computing the smallest number of edits need to convert one thing into another.
Created: 2015-12-04 11:26:25.407
Updated: 2015-12-04 11:26:25.407


Author: mdoering@gbif.org
Comment: I can probalby pretty quickly implement sth that can dump a neo db into GML. Looking at the example GML you have here: https://github.com/rdmpage/forest/blob/master/example/ncbi_animal.gml is it valid if I would omit all the graphics bits and just create nodes and edges with ids and labels?
Created: 2015-12-04 12:32:20.025
Updated: 2015-12-04 12:32:20.025


Author: rdmpage
Created: 2015-12-04 12:40:35.154
Updated: 2015-12-04 12:40:35.154
        
Yes, the extra stuff has been added by yEd to help layout the graph. The basic format is:

graph
[
  node
  [
   id A
   label "Node A"
  ]
  node
  [
   id B
   label "Node B"
  ]
  node
  [
   id C
   label "Node C"
  ]
   edge
  [
   source B
   target A
   label "Edge B to A"
  ]
  edge
  [
   source C
   target A
   label "Edge C to A"
  ]
]
    


Author: mdoering@gbif.org
Created: 2015-12-04 14:38:37.567
Updated: 2015-12-04 14:38:37.567
        
Hi Rod, I have added a GML export method and created 2 gml files for the Mantodea dataset, one original based on the source and one that results from a nub build with that source: https://github.com/mdoering/tree-diff/tree/master/mantodea

Wanna give a diff a try? It is about 3800 nodes large
    


Author: rdmpage
Comment: Sorry for delay in responding, I've had to brush up on my C++ skills. Unfortunately we've stored trees in GML in pretty much exactly the opposite way :O My code assumes parent → child edges, whereas your trees have child → parent edges. I can probably get my code to work with these as well. I'm working coding outputting diffs in HTML, once I've got that working I'll try some of my GML files (I've got Mammal Species of the World whales, for example, to compare with a version of the GBIF nub here), then I'll look at handling your GML files.
Created: 2015-12-08 12:27:14.499
Updated: 2015-12-08 12:27:14.499


Author: mdoering@gbif.org
Created: 2015-12-08 14:41:57.556
Updated: 2015-12-08 14:41:57.556
        
Thanks Rod. Let me know if you want me to create a GML with opposite edge directions. I am using these straight from neo4j, but I can invert them if you prefer that:
 - PARENT_OF
 - SYNONYM_OF
 - BASIONYM_OF

I can also create dumps (text, dwca or gml) for subtrees now easily and/or restrict them to ranks above a certain threshold.
For example it would be nice to compare the higher classification down to family alone
    


Author: rdmpage
Created: 2015-12-08 19:40:43.883
Updated: 2015-12-08 19:40:43.883
        
Markus, here's an automatically generated HTML view of the differences between two classifications, on the left the MSW for whales, on the right a GBIF classification. Colours show differences, grey labels represent nodes that are the same in both trees. Click on a shared or moved name to see the name in the other tree. All a bit crude, but would this be useful?

https://dl.dropboxusercontent.com/u/639486/n8.html

Oh, and the GBIF whale tree has a bunch of issues, such as duplication of taxa, the usual stray genera (mostly fossils, often doubtful names) and somehow the family Ziphiidae in GBIF doesn't include the type genus *Ziphius* !?

Code is all in C++. Will add/create github for this. Could create a web form where trees can be uploaded, or you could compile and do it on your machine.
    


Author: mdoering@gbif.org
Created: 2015-12-08 22:32:57.086
Updated: 2015-12-08 22:32:57.086
        
Nice, this is definitely useful, Rod. A simple text diff gets less and less useful the larger the tree to compare. Your output immediately shows the removed and added taxa which is great.

I can run the tool locally if I manage to compile it, thanks!
    


Author: rdmpage
Created: 2015-12-09 12:02:20.735
Updated: 2015-12-09 12:02:20.735
        
Hi Markus, OK, here's the code https://github.com/rdmpage/forest

This is awful, ancient C++ in a poorly layed out directory structure (the repository includes executables and other junk it shouldn't have), but in the interests of getting something work I thought I'd share it. To build this you need GTL as described in the README, there is a copy of the GTL distribution in the repository. Once you've installed that, the code should compile. Then it's a case of

./forest MSW/Cetacea.gml MSW/GBIF-Cetacea.gml > cet.txt

to compute difference between trees, and

./html MSW/Cetacea.gml MSW/GBIF-Cetacea.gml cet.txt > cet.html

to get web page.

Let me know how you get on. I need to add some error checking code (e.g., to test that the GML files are actually trees), etc. The whale example is interesting (I keep getting distracted by trying to chase up original descriptions of odd names). It's tempting to add C++ code to apply the various tests that I've been playing with in Neo4J, etc., to the GBIF tree to catch the problematic taxa. But I'm assuming/hoping your new nub-building code does that already?
    


Author: rdmpage
Comment: [~mdoering@gbif.org] Did you manage to get this working? I could package it as a web page if that would help.
Created: 2015-12-16 16:27:07.614
Updated: 2015-12-16 16:27:07.614


Author: mdoering@gbif.org
Comment: [~rdmpage], I got a local binary running and manage to produce the text and html diffs of the example. Im still fighting small issues with my GML for larger trees like Aves or just the top down hierarchy until orders. Once done I'll report, it's a useful tool for sure!
Created: 2015-12-17 19:46:29.215
Updated: 2015-12-17 19:46:29.215


Author: mdoering@gbif.org
Created: 2015-12-18 11:15:58.709
Updated: 2015-12-18 11:17:48.88
        
[~rdmpage], I have created a diff from some test GML I generated. I have manually modified the second tree to just change the name of 2 nodes (Mantodea->Mantodela & Anasigerpes->Anasiperpes) and relinked a species (Hestiasula woodi) to a different genus.
https://dl.dropboxusercontent.com/u/457027/diffs/test.html

Do you have an explanation why so many nodes are marked blue as moved? Is it because the root taxon has changed? If so, why are just some descendents marked?

Ive tried it again without the Mantodea change and its the same still. Maybe it is because of authorships?
https://dl.dropboxusercontent.com/u/457027/diffs/test2.html

    


Author: mdoering@gbif.org
Comment: I'm getting the same output for canonical names: https://dl.dropboxusercontent.com/u/457027/diffs/test-without-authors.html
Created: 2015-12-18 11:24:04.469
Updated: 2015-12-18 11:24:04.469


Author: mdoering@gbif.org
Created: 2015-12-18 11:26:26.145
Updated: 2015-12-18 11:26:35.077
        
The edit script seems to get the deleted and inserted nodes right. But the changed branches (=blue moved?) seem wrong:
{noformat}
delete|node|Mantodea
delete|node|Anasigerpes
insert|node|Mantodela
insert|node|Anasiperpes
delete|branch|Acromantini|Anaxarcha
delete|branch|Acromantini|Chrysomantis
...
{noformat}

    


Author: rdmpage
Comment: Hmmm, n to sure what is happening here. Can you please send me the source GML files so that I can take a look and see if I can debug it. Sorry about this.
Created: 2015-12-18 11:53:42.322
Updated: 2015-12-18 11:53:42.322


Author: mdoering@gbif.org
Comment: Attached gml test files and the resulting htm and edit files
Created: 2015-12-18 12:30:06.906
Updated: 2015-12-18 12:30:06.906


Author: rdmpage
Comment: [~mdoering@gbif.org] There don't seem to be any attachments...?
Created: 2015-12-18 12:46:34.376
Updated: 2015-12-18 12:46:34.376


Author: mdoering@gbif.org
Comment: These are at the top, a bit hidden. Or here a link [^test-without-authors.gml],  [^test-without-authors2.gml],  [^test-without-authors.edit],  [^test-without-authors.html]
Created: 2015-12-18 12:51:12.819
Updated: 2015-12-18 12:51:12.819


Author: rdmpage
Comment: Ah, OK, now I see them.
Created: 2015-12-18 12:54:21.06
Updated: 2015-12-18 12:54:21.06


Author: rdmpage
Created: 2015-12-18 13:27:52.365
Updated: 2015-12-18 13:27:52.365
        
OK there's much weirdness going on. The files you sent aren't trees, in that yEd shows a disconnected graph. I've added code to forest to exit if the graph isn't a tree, and it also doesn't recognise these as tree files.

I think it might be an issue with the GML files, in that I think GML expects all the nodes to be declared first first then, the all edges. You've interleaved edges and nodes, and I think this has bad consequences for the GML parser that I use and the yEd program. It looks like if an edge is defined before both nodes are declared, it's basically ignored by yEd.

I'll dig down deeper and try and make sense of what is going on.
    


Author: rdmpage
Created: 2015-12-18 14:29:30.733
Updated: 2015-12-18 14:29:30.733
        
I think the problem lies with the GML files. My code requires the graphs to be trees represented as directed graphs where edges go away from the root. The graph should be declared as directed "directed 1" in the GML file. You might want to exclude the synonym relationships when generating the GML files (or reorient the edges so that it's still a tree and leaf nodes (terminal tips) always have an edge leading to them, never being the source of an edge. Also trees can only have one edge between nodes, and all nodes must be declared before any edges are declared.

The free tool yEd http://www.yworks.com/products/yed is great for debugging graphs, you can quickly se if the graph is connected, and whether it looks like a tree.

Hope this helps.
    


Author: mdoering@gbif.org
Comment: Here is a graphviz view on the tree if interested !test-without-authors.png!
Created: 2015-12-18 16:59:56.529
Updated: 2015-12-18 16:59:56.529


Author: mdoering@gbif.org
Comment: [~rdmpage], I have replaced the uploaded files with the latest, showing all edges after nodes in the GML. But the diff is still weird
Created: 2015-12-18 17:09:59.504
Updated: 2015-12-18 17:10:19.699


Author: rdmpage
Comment: [~mdoering@gbif.org] It's still not a tree though :(  Acromantis javana is part of a cycle, and you can't have cycles in trees...
Created: 2015-12-18 17:23:31.605
Updated: 2015-12-18 17:23:31.605


Author: mdoering@gbif.org
Comment: Damn you are right. I need to completely leave out the pro parte and basionym relations.
Created: 2015-12-18 17:25:23.888
Updated: 2015-12-18 17:25:23.888


Author: rdmpage
Created: 2015-12-18 17:30:32.463
Updated: 2015-12-18 17:30:32.463
        
[~mdoering@gbif.org] And the GML files still have edges going in two different directions. The parent_of is fine, but the synonym_of, proparte_synomym_of, basionnym_of are not.

Sorry to be pedantic, but tree has a very precise definition in the context of this code (and, indeed, graph theory). If the graph is directed then either everything ultimately goes back to the root, or everything comes from the root (the later is what my code requires), and there are no cycles (closed loops).
    


Author: rdmpage
Created: 2015-12-18 17:53:55.476
Updated: 2015-12-18 17:53:55.476
        
[~mdoering@gbif.org] OK, I wrote a PHP script to reverse the edges and manually deleted the cycle. Soooooo close. Comparing the two trees reveals a problem. The code assumes every node label is unique, but Oxypiloidea and Oxypilus occur twice (I'm assuming as genera and subgenera). This -buggers- confuses the algorithm. If we can make the typical subgenus name unique then the algorithm should be happy.

    


Author: mdoering@gbif.org
Created: 2015-12-18 18:54:55.681
Updated: 2015-12-18 18:54:55.681
        
I have added the rank to each name and now the result is perfect and nice. I've attached new files called test2.
I think I'm good now to try the Aves backbone subtree as a real example, much obliged, Rod!
    


Author: rdmpage
Comment: Yay! I've added these examples to the github repo, and also tweaked the code to exit if the GML is not a tree. Fingers crossed for Aves...
Created: 2015-12-18 19:04:13.638
Updated: 2015-12-18 19:04:13.638


Author: mdoering@gbif.org
Created: 2015-12-30 16:17:36.636
Updated: 2015-12-30 16:19:10.532
        
I came to the conclusion it is best to do 2 kinds of diffs.
A tree diff for the taxonomy down to families and a simple sorted list of all genera, species and infraspecies with their family.

Here is a tree diff for the new nub against the Catalog Of Life which is the sole source for the higher taxonomy above families: [^families-col.html]
The CoL diff is as expected, just removing placeholder taxa and unsupported superfamily ranks.

A diff to the current live backbone looks like this: [^families.html]
There are far more changes, even phyla have changed. This also should probably not come as a big surprise since the current classificatin is nearly 4 years old and CoL has advanced since quite a bit.

    


Author: rdmpage
Comment: Interesting. Would it help if the trees where alphabetically sorted? If so, I could look at adding this to the code. My sense from comparing CoL with new nub is that lots of changes are due to fossil taxa. Might be useful to flag fossil-only families as a way to help figure out whether differences are simplyscope (CoL doesn't do fossils) or actual classification.
Created: 2015-12-30 23:45:13.119
Updated: 2015-12-30 23:45:13.119


Author: rdmpage
Created: 2015-12-31 12:53:49.174
Updated: 2015-12-31 12:53:49.174
        
[~mdoering@gbif.org] Just noticed another "bug". The "Not assigned" nodes within each taxonomic rank need not be unique (i.e., you can have multiple "Not assigned" nodes that are orders) , and so the algorithm misses some changes (such as deletion of "Not assigned" nodes). Is it possible for your graph-generating code to create a unique suffix/code for "Not assigned" nodes? Either adding an incremental counter, or some random string to each one will do the trick.

If I was clever I'd add a test to the code to flag instance of non-unique names (and/or make them unique in the way described above).
    


Author: mdoering@gbif.org
Comment: Deemed good enough to go live with no major regressions but lots of improvements
Created: 2016-04-20 10:20:04.931
Updated: 2016-04-20 10:20:04.931


Author: rdmpage
Comment: [~mdoering@gbif.org] Congratulations! Huge amount of work herding taxonomic cats. Let me show my appreciation by seeing if I can poke holes in it ;)
Created: 2016-04-20 10:23:16.317
Updated: 2016-04-20 10:23:16.317