Issue 12336

Statistics: analyze counts for occurrences without a basis of record

12336
Reporter: ahahn
Assignee: jlegind
Type: Task
Summary: Statistics: analyze counts for occurrences without a basis of record
Priority: Major
Status: InProgress
Created: 2012-11-20 16:51:42.137
Updated: 2013-11-13 16:56:28.147
        
Description: Generate summary statistics for occurrence records that do have a basis_of_record = 0:
- dataset id; count of occurrence records total; count of occurrence records with basis_of_record = 0; percentage of these compared to total (calculated); count of records where occurrence_record.basis_of_record = 0 AND length(raw_occurrence_record.basis_of_record) > 1; value of data_resource.basis_of_record

The HIVE query to generate the statistics on the BoR issue are as of 21-11-2012:

SELECT  CONCAT(ROUND(((t2.cnt/t1.cnt)*100),1),'%') AS percent, t2.cnt, t2.data_resource_id, res.name, res.basis_of_record, raw.bor_cnt_raw FROM

(SELECT COUNT(id) AS cnt, data_resource_id FROM rollover_portal_occurrence_record
	GROUP BY data_resource_id) t1
JOIN

(SELECT COUNT(id) AS cnt, data_resource_id FROM rollover_portal_occurrence_record
	WHERE basis_of_record = 0
	GROUP BY data_resource_id) t2

ON t1.data_resource_id = t2.data_resource_id

JOIN

(SELECT id, name, basis_of_record FROM rollover_portal_data_resource_verbatim
	WHERE deleted IS NULL) res

ON t2.data_resource_id = res.id

LEFT OUTER JOIN

(SELECT COUNT(*) AS bor_cnt_raw, data_resource_id FROM rollover_portal_raw_occurrence_record_verbatim
	WHERE length(basis_of_record) > 0
	GROUP BY data_resource_id) raw

ON t2.data_resource_id = raw.data_resource_id]]>
    
Attachment Basis_of_record_statistic3.xls


Author: ahahn@gbif.org
Created: 2012-11-21 17:25:22.123
Updated: 2013-02-18 11:47:17.671
        
Summary generated from counts on index version October 2012 (rollover 38):

| coverage for basis of record (bor) (interpreted) per dataset    | harvested bor from original data? | bor value available at dataset level? | number of datasets concerned | sum of records |
|   missing throughout dataset | yes (but not interpretable)|no|661|34,039,286 |
| missing throughout dataset|none|no|274|28,248,511|
|missing throughout dataset|yes (but not interpretable)|yes|128|16,160,007|
|missing throughout dataset|none|yes|93|5,312,269|
|<=50%|yes (but not interpretable)|no|95|3,518,955|
|<=50%|yes (but not interpretable)|yes|7|250,557|
|>50%|(any)|(any)|533|795,938|

total record count: 389,117,719
--> Gains from linking up dataset metadata are not as large as expected - the main focus will have to be on working with publishers to provide and / or correct supplied values.
    


Author: ahahn@gbif.org
Comment: breakdown per data resource, Nov 21, 2012
Created: 2013-02-18 11:54:52.883
Updated: 2013-02-18 11:54:52.883