Issue 12336
Statistics: analyze counts for occurrences without a basis of record
12336
Reporter: ahahn
Assignee: jlegind
Type: Task
Summary: Statistics: analyze counts for occurrences without a basis of record
Priority: Major
Status: InProgress
Created: 2012-11-20 16:51:42.137
Updated: 2013-11-13 16:56:28.147
Description: Generate summary statistics for occurrence records that do have a basis_of_record = 0:
- dataset id; count of occurrence records total; count of occurrence records with basis_of_record = 0; percentage of these compared to total (calculated); count of records where occurrence_record.basis_of_record = 0 AND length(raw_occurrence_record.basis_of_record) > 1; value of data_resource.basis_of_record
The HIVE query to generate the statistics on the BoR issue are as of 21-11-2012:
SELECT CONCAT(ROUND(((t2.cnt/t1.cnt)*100),1),'%') AS percent, t2.cnt, t2.data_resource_id, res.name, res.basis_of_record, raw.bor_cnt_raw FROM
(SELECT COUNT(id) AS cnt, data_resource_id FROM rollover_portal_occurrence_record
GROUP BY data_resource_id) t1
JOIN
(SELECT COUNT(id) AS cnt, data_resource_id FROM rollover_portal_occurrence_record
WHERE basis_of_record = 0
GROUP BY data_resource_id) t2
ON t1.data_resource_id = t2.data_resource_id
JOIN
(SELECT id, name, basis_of_record FROM rollover_portal_data_resource_verbatim
WHERE deleted IS NULL) res
ON t2.data_resource_id = res.id
LEFT OUTER JOIN
(SELECT COUNT(*) AS bor_cnt_raw, data_resource_id FROM rollover_portal_raw_occurrence_record_verbatim
WHERE length(basis_of_record) > 0
GROUP BY data_resource_id) raw
ON t2.data_resource_id = raw.data_resource_id]]>
Attachment Basis_of_record_statistic3.xls
Author: ahahn@gbif.org
Created: 2012-11-21 17:25:22.123
Updated: 2013-02-18 11:47:17.671
Summary generated from counts on index version October 2012 (rollover 38):
| coverage for basis of record (bor) (interpreted) per dataset | harvested bor from original data? | bor value available at dataset level? | number of datasets concerned | sum of records |
| missing throughout dataset | yes (but not interpretable)|no|661|34,039,286 |
| missing throughout dataset|none|no|274|28,248,511|
|missing throughout dataset|yes (but not interpretable)|yes|128|16,160,007|
|missing throughout dataset|none|yes|93|5,312,269|
|<=50%|yes (but not interpretable)|no|95|3,518,955|
|<=50%|yes (but not interpretable)|yes|7|250,557|
|>50%|(any)|(any)|533|795,938|
total record count: 389,117,719
--> Gains from linking up dataset metadata are not as large as expected - the main focus will have to be on working with publishers to provide and / or correct supplied values.