Issue 14708

Review new Occurrence and VerbatimOccurrence classes

14708
Reporter: kbraak
Assignee: kbraak
Type: SubTask
Summary: Review new Occurrence and VerbatimOccurrence classes
Priority: Critical
Resolution: Fixed
Status: Resolved
Created: 2014-01-15 16:51:44.054
Updated: 2014-01-23 15:02:10.706
Resolved: 2014-01-23 15:02:10.663
        
Description: Issues:

1. DwC term "coordinateUncertaintyInMeters" (called "coordinateAccurracyInMeters") used to be in the former Occurrence class, but is currently missing. I believe this is an important term to interpret, and should be included.

2. Occurrence class uses "coordinateAccurracy", whereas the equivalent DwC term is named "coordinateUncertainty". I remember we wanted to be consistent using "accuracy" vs "uncertainty". Therefore regarding issue 1) above, we would probably use "coordinateAccurracyInMeters" instead of the DwC term coordinateUncertaintyInMeters.

3. Occurrence class uses "identificationDate", whereas the equivalent DwC term is named "dateIdentified". I'm not aware of a reason to deviate from the DwC name.

4. Non-DwC term "hostCountry" used to be in the former Occurrence class, but is currently missing. This is not the same as "publishingOrgCountry", and should be added back in.

5. Non-DwC term "unitQualifier" used to be in the former Occurrence class, but is currently missing. Presumably this was used to store multiple identifications with BioCASE, but there is no javadoc in the former Occurrence class. Presumably this should be added back in.

6. Occurrence class includes DwC terms "waterBody" and "individualCounts", but these terms are not interpreted (I say this since they are of type String). Presumably these can be dropped.

7. Occurrence class does not interpret "geodeticDatum" (I say this since it is of type String). Presumably the datum should be an ENUM.

8. Occurrence class has field "lastInterpreted". VerbatimOccurrence class has field "lastCrawled". I believe these are the same thing, and the "lastInterpreted" field can be dropped.

9. Occurrence class includes fields "altitudeAccurracy" and "depthAccurracy", which are not DwC terms or ABCD terms as far as I am aware. Therefore, how can they be populated? If they can't be populated, they should be dropped.

10. Searches on collector name and collector number are highly sought after (see POR-1550). The DwC term for collector name, is "recordedBy" and should be included and interpreted somehow. I believe the DwC term for collector number is "recordNumber", and this should also be included and interpreted somehow. While we're at it, why not try to interpret the "scientificNameAuthor" or "identifierName" as well.

11. @Min @Max annotations are missing from the "year", "month", and "day" fields in Occurrence class. These used to be in the former Occurrence class.

The comparison between the new and former Occurrence classes can be seen here: http://tinyurl.com/extendedOcc This spreadsheet also lists which terms are shown in the Occurrence detail page, and which are not.]]>
    


Author: kbraak@gbif.org
Comment: [~mdoering@gbif.org] and [~omeyn@gbif.org] I found 11 issues with the new Occurrence and VerbatimOccurrence classes. Can you please review my list of issues, and provide me with your thoughts? Thanks
Created: 2014-01-15 17:41:10.629
Updated: 2014-01-15 17:41:10.629


Author: mdoering@gbif.org
Created: 2014-01-15 18:56:54.601
Updated: 2014-01-15 18:57:39.775
        
1+2: the class uses the single coordinateAccurracy to represent accurracy. We do not need to interpret 2 different ways for the same thing, do we?

3: lets rename to dateIdentified

4: host country got renamed into publishingOrgCountry, use this please. (Why didnt we call it publishingCountry as we do everywhere else? I would propose to rename it into publishingCountry)

5: lets add back unitQualifier only if we know whats it used for and document it. BioCASE crawling related maybe?

6: no, we need to keep them as they will get interpreted even though being a string. We will clean up noise.

7: same with geodeticDatum, its an interpreted string

8: we need both lastInterpreted and lastCrawled. Later is when we crawled the raw record, first is when we last ran the interpretations (which can happen independent of crawling several times)

9: these are called "accurracy" for consistency and get populated by looking at min/maxElevationInMeters etc

10: Not sure, but so far we did not think about interpreting person names at all. recordedBy and recordNumber search will simply use the verbatim values. Is it reasonable to think we will do some cleanup and normalisation at least? Could try for spaces and non alphanumerics at least. We should discuss person names further

11: good catch!
    


Author: kbraak@gbif.org
Created: 2014-01-16 17:09:47.669
Updated: 2014-01-16 17:09:47.669
        
1+2: How the coordinateAccuracy is calculated using the coordinateUncertaintyInMeters needs to be documented in our API. The "inMeters" is valuable in itself. There surely is going to be confusion exposing both interpreted coordinateAccuracy and verbatim coordinateUncertaintyInMeters.

3: OK

4: host country is the country of the organization, serving the dataset the occurrence belongs to. publishing country is the country of the organization publishing the dataset the occurrence belongs to. I just want to make sure, we don't want to capture the former, only the latter.

5: The question of why unitQualifier is currently being stored, and whether it is still needed, goes to [~omeyn@gbif.org].

6: If we are interpreting waterBody, why not island and islandGroup. A search filter by island name is requested in POR-1518

7: OK

8: OK

9: I'm curious, how you would calculate depth and depthAccuracy?

Let's say the data comes in as minimumDepthInMeters (e.g. 3 m) and maximumDepthInMeters (13 m). Keep in mind, in our API both these fields are of type Integer. How it is calculated, and whether the value is rounded up to the next greater integer value should be documented in our API.

10: Definitely some cleanup and normalization. This is a definite area of improvement.

11: Thanks
    


Author: mdoering@gbif.org
Comment: 1) for the interpreted occurrence page we should only expose the single, interpreted coordinateAccuracy  value, and only show the verbatim ones on the verbatim page
Created: 2014-01-16 20:21:07.128
Updated: 2014-01-16 20:21:07.128


Author: mdoering@gbif.org
Comment: 4) hostCountry was a misnamed property for publishingCountry. We NEVER interpreted and dealt with the installation hosting organisation on occurrence records. So dont worry, lets rename the property to publishingCountry and use it where we use host before
Created: 2014-01-16 20:22:51.145
Updated: 2014-01-16 20:22:51.145


Author: mdoering@gbif.org
Comment: 6) we could interpret all location terms of course, but when we started the widening we said these are the fields we like to interpret and make searchable. In my mind it would be better to have a rankless location gazateer at some point that only returns polygons for the names and search then with those. 
Created: 2014-01-16 20:26:53.728
Updated: 2014-01-16 20:26:53.728


Author: mdoering@gbif.org
Comment: 9) depth = (min+maxDepth) / 2;   depthAccuracy=(max-minDepth) / 2
Created: 2014-01-16 20:27:44.789
Updated: 2014-01-16 20:27:44.789


Author: kbraak@gbif.org
Created: 2014-01-17 10:00:26.361
Updated: 2014-01-17 10:00:26.361
        
Thanks for clarifying 1) 4) and 6).

Regarding 9) take an example calculation for depthAccuracy, e.g. (5 - 2) / 2 = 1.5 Are we going to just round up, and keep the type as Integer, or preserve the decimal and use type Double? 
    


Author: mdoering@gbif.org
Comment: 9) I would keep it as int. An accurracy of 1 meter is more than good enough I think
Created: 2014-01-17 10:34:46.858
Updated: 2014-01-17 10:34:46.858


Author: omeyn@gbif.org
Comment: We still need unitQualifier for the biocase crawling differentiation of multiple "unit"s for a single triplet. But we only need it at the fragment level to determine if it's a new or an update, so let's leave it off the Occurrence unless we find we need it later.
Created: 2014-01-17 10:42:27.859
Updated: 2014-01-17 10:42:27.859


Author: kbraak@gbif.org
Created: 2014-01-23 11:49:19.272
Updated: 2014-01-23 11:49:19.272
        
Thanks for all the answers.

I just noticed that we misspell accurracy with 2 'r's, e.g. coordinateAccurracy. It should only have 1. I propose changing all such cases in our Occurrence model object.

E.g. coordinateAccurracy -> coordinateAccuracy 
    


Author: kbraak@gbif.org
Comment: Work complete. Closing issue.
Created: 2014-01-23 15:02:10.703
Updated: 2014-01-23 15:02:10.703