Issue 12407

Clean up on datasets deleted from the registry

12407
Reporter: ahahn
Assignee: ahahn
Type: Task
Summary: Clean up on datasets deleted from the registry
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-11-26 11:17:07.34
Updated: 2012-12-13 15:11:06.721
Resolved: 2012-11-29 14:33:38.013
        
Description: Follow up registry deletions (synchronizer) with
- run batch deletion script for datasets from publisher 145
- double-check that other deleted dataset indeed should be gone
- check whether a duplicate entry got created a) in the registry (new uuid) / b) in the index
- in case of duplicates a), update the index of the old entry with the new uuid to keep indexing into the existing dataset, b) remove the duplicate

Pangea (data_provider_id=145)

531 datasets have been deleted. All of them had reported a total of 37,216 records

Here I provide a list of IDs of all these datasets, for easing up the inputting of them into a deletion script maybe?


8759, 8760, 8761, 8762, 8763, 8764, 8765, 8766, 8767, 8768, 8770, 8771, 8772, 8773, 8774, 8775, 8793, 8794, 8809, 8810, 8811, 8812, 8813, 8814, 8815, 8816, 8817, 8818, 8819, 8820, 8821, 8822, 8823, 8862, 8953, 8954, 9212, 9226, 9227, 9228, 9229, 9258, 9259, 9260, 9261, 9262, 9263, 9264, 9265, 9266, 9267, 9268, 9269, 9270, 9271, 9272, 9273, 9274, 9275, 9276, 9277, 9278, 9279, 9280, 9281, 9282, 9283, 9284, 9285, 9286, 9287, 9288, 9289, 9297, 9298, 9299, 9300, 9301, 9302, 9303, 9304, 9305, 9306, 9307, 9352, 9353, 9868, 9869, 9870, 9871, 9872, 9873, 9896, 9897, 9990, 9991, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010, 10031, 10032, 10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059, 10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068, 10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077, 10078, 10079, 10080, 10131, 10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158, 10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176, 10177, 10178, 10179, 10180, 10231, 10232, 10233, 10234, 10250, 10251, 10252, 10253, 10254, 10255, 10256, 10257, 10258, 10259, 10260, 10261, 10262, 10279, 10280, 10300, 10301, 10302, 10303, 10304, 10305, 10306, 10307, 10308, 10309, 10310, 10311, 10312, 10313, 10314, 10315, 10316, 10317, 10318, 10319, 10320, 10322, 10323, 10324, 10325, 10326, 10327, 10328, 10329, 10330, 10331, 10332, 10333, 10334, 10335, 10336, 10337, 10338, 10339, 10340, 10341, 10342, 10343, 10344, 10345, 10346, 10347, 10348, 10349, 10399, 10400, 10401, 10402, 10403, 10404, 10405, 10406, 10407, 10408, 10409, 10410, 10411, 10412, 10413, 10414, 10415, 10416, 10417, 10418, 10419, 10420, 10421, 10422, 10423, 10424, 10425, 10426, 10427, 10729, 10730, 10731, 10736, 10737, 10738, 10739, 10740, 10741, 10742, 10743, 10768, 10769, 10770, 10771, 10773, 10774, 12131, 12834, 12934, 2050, 2075, 2173, 2174, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2394, 2453, 2459, 2460, 2603, 5821, 5822, 5823, 5824, 5825, 5826, 5827, 5828, 5829, 5830, 5831, 5832, 5833, 5835, 5836, 5837, 5838, 5839, 5840, 5841, 5842, 5843, 5844, 5845, 5846, 5847, 5848, 5849, 5850, 5851, 5852, 5853, 5854, 5855, 5856, 5857, 5858, 5859, 5860, 5861, 5862, 6580, 6822, 6823, 6824, 6825, 6826, 6827, 6828, 7115, 7504, 7505, 7506, 7507, 7508, 7509, 7510, 7511, 7766, 7767, 7768, 7769, 7770, 7771, 7772, 7773, 7851, 7852, 7853, 7854, 8446, 8447, 8449, 8450, 8451, 8452, 8453, 8454, 8455, 8456, 8457, 8458, 8459, 8460, 8461, 8462, 8463, 8464, 8465, 8466, 8467, 8468, 8469, 8470, 8471, 8472, 8473, 8474, 8475, 8509, 8510, 8511, 8512, 8513, 8514, 8521, 8522, 8523, 8524, 8525, 8526, 8527, 8528, 8540, 8541, 8542, 8543, 8548, 8549, 8550, 8551, 8608, 8609, 8610, 8611, 8612, 8613, 8614, 8615, 8616, 8617, 8618, 8619, 8620, 8645, 8646, 8647, 8648, 8649, 3827, 3828, 3829, 3830, 3831, 3832, 4753, 4754, 4756, 4757, 4758, 4759, 4760, 4761, 4762, 4763, 5259, 5260, 5261, 5262, 5263, 5264, 5265, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5289, 5290, 5291, 5818, 5819, 5820

Doing a quick query on krayt, I guess the number 37,216 occurrences (SELECT count(*) from occurrence_record where data_resource_id IN (....those IDs...)

--> all deleted via script, 26.11.12 <--
=====================================================
Senckenberg (data_provider_id=155)

18 datasets have been deleted. Reporting 2762 records.

14513, 8356, 8357, 8358, 8359, 8360, 8361, 8362, 8363, 8364, 8365, 8366, 8367, 8368, 8369, 8370, 8371, 8372

=====================================================
Comisión nacional para el conocimiento y uso de la biodiversidad (data_provider_id=213)

17 datasets have been deleted. Reporting 359,122 records

12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508

=====================================================
BeBIF Provider (data_provider_id=12)

1 dataset has been deleted. Reporting 2643 records.

dataset = 90

=====================================================
Bird Studies Canada (data_provider_id=18)

1 dataset has been deleted. Reporting 0 records.

dataset = 11529

=====================================================
Korea Institute of Science and Technology Information (data_provider_id=33)

1 dataset has been deleted. Reporting 40 records.

dataset = 109

=====================================================
Finnish Museum of Natural History (data_provider_id=50)

1 dataset has been deleted. Reporting 173,108 records.

dataset = 14040

=====================================================
San Diego Natural History Museum (data_provider_id=151)
SHOULD NOT BE DELETED. See comment
1 dataset has been deleted. Reporting 19,583 records.

dataset = 638

=====================================================
University of Alberta Museums (data_provider_id=178)

1 dataset has been deleted. Reporting 27,202 records.

dataset = 772

=====================================================
AIT Austrian Institute of Technology GmbH (data_provider_id=465)

1 dataset has been deleted. Reporting 0 records.

dataset = 14546
]]>
    


Author: ahahn@gbif.org
Comment: Checking with Andrei on an update to the batch deletion script for provider_id 145 - done, all removed
Created: 2012-11-26 11:30:04.21
Updated: 2012-11-27 16:47:34.564


Author: jlegind@gbif.org
Created: 2012-11-27 12:19:01.547
Updated: 2012-11-27 12:19:01.547
        
San Diego Natural History Museum (data_provider_id=151)Mammals specimens id 638 should not be deleted.
The curator claims it is an IT department oversight.
    


Author: ahahn@gbif.org
Created: 2012-11-27 16:47:59.76
Updated: 2012-11-29 14:27:12.626
        
Senckenberg: (José)
These are 18 the problematic datasets from Senckenberg

17 of them come from a single access point - http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk

All of these access points exists in the wrapper response, but for some reason have been deleted from the Registry.
This is a problem and must be solved.

Crustacea ZMK (remote_id_at_url=Crustacea ZMK)
http://data.gbif.org/datasets/resource/8356

Ornithologie ZMK (remote_id_at_url= Ornithologie ZMK)
http://data.gbif.org/datasets/resource/8357

Ichthyologie ZMK (remote_id_at_url= Ichthyologie ZMK)
http://data.gbif.org/datasets/resource/8358

Protozoa ZMK (remote_id_at_url= Protozoa ZMK)
http://data.gbif.org/datasets/resource/8359

Mammalogie ZMK (remote_id_at_url= Mammalogie ZMK)
http://data.gbif.org/datasets/resource/8360

Amphibia ZMK (remote_id_at_url= Amphibia ZMK)
http://data.gbif.org/datasets/resource/8361

Pantopoda SMF (remote_id_at_url= Pantopoda SMF)
http://data.gbif.org/datasets/resource/8362

Reptilia ZMK (remote_id_at_url= Reptilia ZMK)
http://data.gbif.org/datasets/resource/8363

Bryozoa ZMK (remote_id_at_url= Bryozoa ZMK)
http://data.gbif.org/datasets/resource/8364

Cnidaria ZMK (remote_id_at_url= Cnidaria ZMK)
http://data.gbif.org/datasets/resource/8365

Vermes ZMK (remote_id_at_url= Vermes ZMK)
http://data.gbif.org/datasets/resource/8366

Malakologie ZMK (remote_id_at_url= Malakologie ZMK)
http://data.gbif.org/datasets/resource/8367

Echinodermata ZMK (remote_id_at_url= Echinodermata ZMK)
http://data.gbif.org/datasets/resource/8368

Porifera ZMK (remote_id_at_url= Porifera ZMK)
http://data.gbif.org/datasets/resource/8369

Arachnologie ZMK (remote_id_at_url= Arachnologie ZMK)
http://data.gbif.org/datasets/resource/8370

Ctenophora ZMK (remote_id_at_url= Ctenophora ZMK)
http://data.gbif.org/datasets/resource/8371

Tunicata ZMK (remote_id_at_url= Tunicata ZMK)
http://data.gbif.org/datasets/resource/8372


==================================
The last dataset is a different case. This resource exists in the wrapper response, but it seems it was deleted from the Registry and then created again on a later day.

Collection Mammalogie - ZSRO (remote_id_at_url=Collection Mammalogie - ZSRO)
http://data.gbif.org/datasets/resource/14513

GBIF Registry Entry
http://gbrds.gbif.org/browse/agent?uuid=f440262c-1d2c-11e2-8fd4-00145eb45e9a created on the 23rd of October

http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zsro

Currently they dont share the same UUID (portal vs registry) for obvious reasons, as the one listed in the portal was the old registration that was deleted.


=================================
TEMPORARY FIX

For the last dataset listed above, the fix will be to reassign the UUID in the portal DB to the one pointing to the Registry.

--> done, 29.11.12

For the other deleted DS, I think we should run a synchronization against the access point: http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk and see if they get added again?
If they are not added again, then a manual addition might be necessary.

--> done, 29.11.12 - most of them reappeared in the registry; updated portal.data_resource.gbif_registry_uuid on the new agent uuids. Exception: "Pantopoda SMF (remote_id_at_url= Pantopoda SMF)" (not recreated)

In any case, these are temporary fixes cause the metadata-sync has a bug that it does not recognizes properly this kind of Access Points and its doing unexpected behaviour

    


Author: ahahn@gbif.org
Created: 2012-11-27 16:48:45.965
Updated: 2012-11-29 14:01:24.369
        
Mexico (José):

For the following datasets, all of them were CORRECTLY deleted by the synchronizer, as they don't appear anymore on the access point associated to them  (http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php)
I have checked the DiGIR XML Response to see if we have a match (on  and ) and it does not seems so, they are gone.

============
http://data.gbif.org/datasets/resource/1593/ - Colección de Mamíferos de Nuevo León, México (UANL) - (remote id=UANL-UANL)

http://data.gbif.org/datasets/resource/1595/  - Herbario del Instituto de Ecología, A.C., México (IE-BAJIO) - (remote id=IE-BAJIO)

http://data.gbif.org/datasets/resource/1597/  - Herbario del Instituto de Ecología, A.C., México (IE-XAL) - (remote id=IE-XAL)

http://data.gbif.org/datasets/resource/1599/  - Banco Nacional de Germoplasma Vegetal, México (BANGEV, UACH) - (remote id=BANGEV-UACH)

http://data.gbif.org/datasets/resource/1601/  - Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-IPN)

http://data.gbif.org/datasets/resource/2498/  - Ejemplares tipo de plantas vasculares del Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-PLANTASVASCULARES)

http://data.gbif.org/datasets/resource/2499/  - Estudio Florístico de la Sierra de Pachuca, Hidalgo, México (ENCB, IPN) - (remote id=ENCB-ESTUDIOFLORISTICO)

http://data.gbif.org/datasets/resource/2500/  - Ictiofauna de la Región de Las Huastecas, México (ENCB, IPN) - (remote id=ENCB-ICTIOFAUNA)

http://data.gbif.org/datasets/resource/2501/ - Colección Nacional de Peces Dulceacuícolas Mexicanos de la Escuela Nacional de Ciencias Biológicas, IPN (ENCB, IPN) - (remote id=ENCB-PECES)

http://data.gbif.org/datasets/resource/2502/ - Ictiofauna de la Cuenca del Río Lerma, México (ENCB, IPN) - (remote id=ENCB-ICITOFAUNARIOLERMA)

http://data.gbif.org/datasets/resource/2503/ - Ictioplancton de las lagunas Madre y Almagre, Tamaulipas, y laguna de Tampamachoco, Veracruz, México (ENCB, IPN) - (remote id=ENCB-ICTIOPLANCTON)

http://data.gbif.org/datasets/resource/2505/ - Las familias Polyporaceae sensu stricto y Albatrellaceae en México (ENCB, IPN) - (remote id=ENCB-POLYPORACEAE_ALBATRELLACEAE)

http://data.gbif.org/datasets/resource/2506/ - Biodiversidad de los mamíferos en el Estado de Michoacán, México (ENCB, IPN) - (remote id=ENCB-MAMIFEROSMICHOACAN)

http://data.gbif.org/datasets/resource/2507/ - Estudio monográfico del género Echinopepon Naud. (Cucurbitaceae) en México (ENCB, IPN) - (remote id=ENCB-ECHINOPEPON)

http://data.gbif.org/datasets/resource/2508/ - Algas coralinas articuladas (Rhodophyta-corallinales) de México (ENCB, IPN) - (remote id=ENCB-ALGASCORALINAS)

http://data.gbif.org/datasets/resource/13105/ - Moluscos macrobénticos del intermareal y plataforma continental de Jalisco y Colima - (remote id=U_de_G_S110)



with the following dataset, this was a special case which the synchronizer handled as expected. It is just that on the Mexican side, they have bad data, here the situation:

http://data.gbif.org/search/Colecci%C3%B3n%20de%20Diatomeas

there are two datasets with the same name, but with different access points, and diffent 

http://data.gbif.org/datasets/resource/11115/  (remote_id_at_url=ICMyL-DF-Diatomeas)  on the access point = http://132.248.15.4/digir/DiGIR.php
http://data.gbif.org/datasets/resource/12137/ (remote_id_at_url=ICMyL-DF) on the access point = http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php


the one with ID=12137 has been removed from the DiGIR response and it was removed from the Registry as well by the synchronizer - which is good,
leaving us just the DR_ID=11115 which I think is good, but we still need to delete that from the indexing database.


======
So in conclusion, I can say with confidence the syncronizer has acted in a good way with this publisher. So here is the list of IDs (same ones as last email) so we can batch delete them from the index DB


12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508

--> all removed, 29.11.12

    


Author: ahahn@gbif.org
Created: 2012-11-27 16:49:36.453
Updated: 2012-11-29 14:32:57.202
        
other resources (José, Jan):

==========================================

BeBIF Provider  (data_provider_id=12)

Metadata sync deleted good.
The resource at http://data.gbif.org/datasets/resource/90/ is gone at the source.

Solution: Delete data_resource_id=90 from the indexing DB.

--> 29.11.12

==========================================

Bird Studies Canada  (data_provider_id=18)
1 conflicting DS
Our data portal has 2 duplicates of the same resource "Marsh Monitoring Program - Birds"
http://data.gbif.org/search/Marsh%20Monitoring which gives

http://data.gbif.org/datasets/resource/11529/ (remote_id_at_url=mmpbirds2)
http://data.gbif.org/datasets/resource/59/ (remote_id_at_url=mmpbirds)

Solution: In this case it will suffice deleting data_resource_id=11529 from the indexing DB.

--> deleted, 29.11.12

==========================================

Korea Institute of Science and Technology Information (data_provider_id=33)

Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/109/ is gone at the source.

Solution: Delete data_resource_id=109 from the indexing DB.

--> resource 109 deleted, 27.11.12

==========================================

San Diego Natural History Museum (data_provider_id=151)

Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/638/ is gone at the source.

Solution: Delete data_resource_id=638 from the indexing DB.

* Track with Jan, he asked the publisher if they really deleted that Dataset

--> see comment from Jan earlier: _not_ to be deleted!

==========================================

University of Alberta Museums (data_provider_id=178)

Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/772/ is gone at the source.

Solution: Delete data_resource_id=772 from the indexing DB.

* Track with Jan, he asked the publisher if they really deleted that Dataset

==========================================
AIT Austrian Institute of Technology GmbH (data_provider_id=465)

The resource at http://data.gbif.org/datasets/resource/14546/ has different UUID from the one in the Registry.
But it also has  different name

Portal : The DNA & Sample Repository @ AIT

vs

Registry: The DNA and Sample Repository at AIT

I don't know what happened here! how can the name be changed?

Solution : I would suggest strongly to delete the data_resource_id=14546 as it currently does not has any occurrences indexed, so better to delete it and then
reindex it from the Publisher's page (http://gbrds.gbif.org/browse/agent?uuid=81bfa2a5-22a8-4bea-b91c-d54bda1365b9) via the metadata sync.

--> 14546 deleted 27.11.12; >metadata update waiting for service restart< done


==========================================
Finnish Museum of Natural History (data_provider_id=50)

The resource at http://data.gbif.org/datasets/resource/14040/ does not share the same UUID with the resource in the Registry

http://gbrds.gbif.org/browse/agent?uuid=aea3a96a-3580-11e2-918b-00145eb45e9a

This is the known dataset (Fieldjournal.org...). This is part of the Biocase issue, deleting and then creating the same dataset.

Temporary solution: modify the UUID in the portal database, to point to the Registry one. BUT this needs a fix in the synchronizer as well.

--> UUID fixed, 27.11.12


    


Author: ahahn@gbif.org
Comment: Most cases have been handled now. For the remaining, creating individual issues, then closing this one.
Created: 2012-11-29 14:14:44.26
Updated: 2012-11-29 14:14:44.26