Issue 11905

Remove duplicate access point URLs from portal db

11905
Reporter: ahahn
Assignee: ahahn
Type: Task
Summary: Remove duplicate access point URLs from portal db
Priority: Major
Status: Open
Created: 2012-09-14 14:01:12.803
Updated: 2013-02-19 10:59:26.596
        
Description: Running statistics on usage of different protocols currently relies on evaluation of the registered access point URLs. For some datasets, more than one access point is registered and active (i.e. not flagged as deleted), even though the datasets are not indexed twice, overestimating the counts. After checking, the inactive (non-indexed) access points need to be flagged as deleted in the data portal:

select dr.id,dr.data_provider_id,dr.name,count(dr.id)
from data_resource dr
   join resource_access_point rap on dr.id=rap.data_resource_id
where dr.deleted is null
and rap.deleted is null
group by 1
having count(dr.id)>1;

+-------+------------------+--------------------------------------------------------------+--------------+
| id    | data_provider_id | name                                                         | count(dr.id) |
+-------+------------------+--------------------------------------------------------------+--------------+
|  2593 |              261 | ECOCEAN Whale Shark Photo-identification Library             |            2 |
|  8402 |              303 | Royal Botanic Garden Edinburgh Herbarium (E)                 |            2 |
|  9106 |              143 | Australian National Wildlife Collection provider for OZCAM   |            2 |
|  9167 |              303 | Royal Botanic Garden Edinburgh Living Plant Collections (E)  |            2 |
| 12994 |              143 | CSIRO Ichthyology provider for OZCAM                         |            2 |
| 14384 |              461 | Invertebrates of the Gothenburg Natural History Museum (GNM) |            2 |
+-------+------------------+--------------------------------------------------------------+--------------+
]]>
    


Author: ahahn@gbif.org
Created: 2012-09-19 17:25:11.558
Updated: 2012-09-19 17:25:11.558
        
Cleaned up most, but Gothenburg still persisting due to metadata update issue

8402 and 9167: both remain, as they are for different protocols
9106 and 12994: both were for different protocols, but TAPIR no longer exists in registry - removed
2593: two different protocols, still valid; but see http://dev.gbif.org/issues/browse/DM-34

    


Author: ahahn@gbif.org
Comment: Next step: re-run the script, pick up on new cases by running metadata updaters
Created: 2012-09-19 17:26:19.576
Updated: 2012-09-19 17:26:19.576


Author: ahahn@gbif.org
Created: 2012-11-20 16:34:08.342
Updated: 2012-11-20 16:34:08.342
        
on re-run 20.11.2012:


| id    | data_provider_id | name                                                                                        | count(dr.id) |

|   692 |              139 | EIS_ORTHOPTERA                                                                              |            2 |
|  1602 |               51 | Fishbase                                                                                    |            2 |
|  2593 |              261 | ECOCEAN Whale Shark Photo-identification Library                                            |            2 |
|  7953 |              139 | Naturalis National Natural History Museum (NL) û Crustacea_Decapoda                         |            2 |
|  7984 |              281 | Herbarium                                                                                   |            2 |
|  8076 |              285 | Rapid Assessment Program (RAP) Biodiversity Survey Database                                 |            2 |
|  8402 |              303 | RBGE Herbarium (E)                                                                          |            2 |
|  9153 |              319 | Consortium of California Herbaria                                                           |            2 |
|  9167 |              303 | RBGE Living Collections                                                                     |            2 |
| 10860 |              267 | Flora of tanzania                                                                           |            2 |
| 11290 |              139 | Zoological Museum Amsterdam, University of Amsterdam (NL) û Mollusca_Types                  |            2 |
| 11291 |              139 | Zoological Museum Amsterdam, University of Amsterdam (NL) û Mollusca_Conidae                |            2 |
| 11505 |              139 | Taxonomic revision of the spider family Penestomidae (Araneae, Entelegynae) û Specimen data |            2 |
| 11951 |              130 | Bolus Herbarium                                                                             |            2 |
| 11953 |              130 | SABCA                                                                                       |            2 |
| 11955 |              130 | Ditsong Museum                                                                              |            2 |
| 11956 |              130 | SAIAB                                                                                       |            2 |
| 12489 |              139 | Zoological Museum Amsterdam, University of Amsterdam (NL) û Mammalia                        |            2 |
| 12714 |              130 | Millenium Seedbank (MSB)                                                                    |            2 |
| 12715 |              130 | PRECIS (KwaZulu-Natal Herbarium)                                                            |            2 |
| 12716 |              130 | PRECIS                                                                                      |            2 |
| 14384 |              461 | Invertebrates of the Gothenburg Natural History Museum (GNM)                                |            2 |

    


Author: ahahn@gbif.org
Created: 2013-02-19 10:56:29.402
Updated: 2013-02-19 10:58:35.648
        
| id    | data_provider_id | name   | count(dr.id) |
|  1102 |  34 | Centre for Genetic Resources, the Netherlands, PGR passport data     |            2 |
| | | --> this dataset is indeed served via BioCASe (http://gbrds.gbif.org/browse/agent?uuid=60471722-f762-11e1-a439-00145eb45e9a) and IPT (http://gbrds.gbif.org/browse/agent?uuid=85796928-f762-11e1-a439-00145eb45e9a), though the BioCASe endpoint is inactive. | |
|  1827 |              214 | Harvard University Herbaria                                                                                                      |            2 |
| | | --> Duplicate dataset registration through the same organisation (http://gbrds.gbif.org/browse/agent?uuid=485ff490-e3b7-11db-9acc-b8a03c50a862). One IPT access point is linked to a "serves" relationship with a DiGIR installation – needs resolving (http://gbrds.gbif.org/browse/agent?uuid=601f403a-f762-11e1-a439-00145eb45e9a)| |
| 11505 |              475 | Naturalis Biodiversity Center (NL) - Taxonomic revision of the spider family Penestomidae (Araneae, Entelegynae) - Specimen data |            2 |
|  | | --> Has indeed two access points registered (TAPIR, http://gbrds.gbif.org/browse/agent?uuid=605bcad2-f762-11e1-a439-00145eb45e9a, and IPT) - legitimate | |