Uploaded image for project: 'Portal'
  1. Portal
  2. POR-2986

Verify new backbone is acceptable for production

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Checklistbank
    • Labels:

      Description

      When we create a new backbone we need to verify it is suitable for production.

      Think and implement ways to ensure we do not regress our taxonomic backbone.

      Consider implementing the existing NubAssertions class to test a newly generated backbone. If tests fail the new backbone will not be synced to postgres automatically. Add various tests to the assertion class with a focus on individual taxon tests verifying known problematic taxa, e.g. the Oenanthe homonyms and stable ids for some taxa.

      In addition other tools to screen a backbone would be useful, e.g. smart diff visualizations

      The biggest source of tracking important changes in the backbone is what effect a new nub has on matched occurrences.
      When a new candidate is ready we need to rematch all (distinct) occurrences and evaluate the changed matches and classifications. Drastic changes like a different kingdom need to be monitored (there are many bad taxa in our current backbone with a wrong kingdom, so such changes might actually be an improvement!)

        Gliffy Diagrams

        1. families.html
          5.60 MB
          Markus Döring
        2. families-col.html
          2.55 MB
          Markus Döring
        3. test2.edit
          0.9 kB
          Markus Döring
        4. test2.gml
          41 kB
          Markus Döring
        5. test2.html
          119 kB
          Markus Döring
        6. test2b.gml
          41 kB
          Markus Döring
        7. test-without-authors.dot
          25 kB
          Markus Döring
        8. test-without-authors.edit
          9 kB
          Markus Döring
        9. test-without-authors.gml
          46 kB
          Markus Döring
        10. test-without-authors.html
          79 kB
          Markus Döring
        11. test-without-authors2.gml
          46 kB
          Markus Döring
        1. DiffMerge.png
          201 kB
        2. FileMerge.png
          166 kB
        3. test-without-authors.png
          2.11 MB
        4. test-without-authors.png
          254 kB

        Issue Links

          Activity

          Hide
          Markus Döring added a comment - - edited

          I came to the conclusion it is best to do 2 kinds of diffs.
          A tree diff for the taxonomy down to families and a simple sorted list of all genera, species and infraspecies with their family.

          Here is a tree diff for the new nub against the Catalog Of Life which is the sole source for the higher taxonomy above families: families-col.html
          The CoL diff is as expected, just removing placeholder taxa and unsupported superfamily ranks.

          A diff to the current live backbone looks like this: families.html
          There are far more changes, even phyla have changed. This also should probably not come as a big surprise since the current classificatin is nearly 4 years old and CoL has advanced since quite a bit.

          Show
          Markus Döring added a comment - - edited I came to the conclusion it is best to do 2 kinds of diffs. A tree diff for the taxonomy down to families and a simple sorted list of all genera, species and infraspecies with their family. Here is a tree diff for the new nub against the Catalog Of Life which is the sole source for the higher taxonomy above families: families-col.html The CoL diff is as expected, just removing placeholder taxa and unsupported superfamily ranks. A diff to the current live backbone looks like this: families.html There are far more changes, even phyla have changed. This also should probably not come as a big surprise since the current classificatin is nearly 4 years old and CoL has advanced since quite a bit.
          Hide
          Roderic D. M. Page added a comment -

          Interesting. Would it help if the trees where alphabetically sorted? If so, I could look at adding this to the code. My sense from comparing CoL with new nub is that lots of changes are due to fossil taxa. Might be useful to flag fossil-only families as a way to help figure out whether differences are simplyscope (CoL doesn't do fossils) or actual classification.

          Show
          Roderic D. M. Page added a comment - Interesting. Would it help if the trees where alphabetically sorted? If so, I could look at adding this to the code. My sense from comparing CoL with new nub is that lots of changes are due to fossil taxa. Might be useful to flag fossil-only families as a way to help figure out whether differences are simplyscope (CoL doesn't do fossils) or actual classification.
          Hide
          Roderic D. M. Page added a comment -

          Markus Döring Just noticed another "bug". The "Not assigned" nodes within each taxonomic rank need not be unique (i.e., you can have multiple "Not assigned" nodes that are orders) , and so the algorithm misses some changes (such as deletion of "Not assigned" nodes). Is it possible for your graph-generating code to create a unique suffix/code for "Not assigned" nodes? Either adding an incremental counter, or some random string to each one will do the trick.

          If I was clever I'd add a test to the code to flag instance of non-unique names (and/or make them unique in the way described above).

          Show
          Roderic D. M. Page added a comment - Markus Döring Just noticed another "bug". The "Not assigned" nodes within each taxonomic rank need not be unique (i.e., you can have multiple "Not assigned" nodes that are orders) , and so the algorithm misses some changes (such as deletion of "Not assigned" nodes). Is it possible for your graph-generating code to create a unique suffix/code for "Not assigned" nodes? Either adding an incremental counter, or some random string to each one will do the trick. If I was clever I'd add a test to the code to flag instance of non-unique names (and/or make them unique in the way described above).
          Hide
          Markus Döring added a comment -

          Deemed good enough to go live with no major regressions but lots of improvements

          Show
          Markus Döring added a comment - Deemed good enough to go live with no major regressions but lots of improvements
          Hide
          Roderic D. M. Page added a comment -

          Markus Döring Congratulations! Huge amount of work herding taxonomic cats. Let me show my appreciation by seeing if I can poke holes in it

          Show
          Roderic D. M. Page added a comment - Markus Döring Congratulations! Huge amount of work herding taxonomic cats. Let me show my appreciation by seeing if I can poke holes in it

            People

            • Assignee:
              Markus Döring
              Reporter:
              Markus Döring
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: