12407 Reporter: ahahn Assignee: ahahn Type: Task Summary: Clean up on datasets deleted from the registry Priority: Major Resolution: Fixed Status: Closed Created: 2012-11-26 11:17:07.34 Updated: 2012-12-13 15:11:06.721 Resolved: 2012-11-29 14:33:38.013 Description: Follow up registry deletions (synchronizer) with - run batch deletion script for datasets from publisher 145 - double-check that other deleted dataset indeed should be gone - check whether a duplicate entry got created a) in the registry (new uuid) / b) in the index - in case of duplicates a), update the index of the old entry with the new uuid to keep indexing into the existing dataset, b) remove the duplicate Pangea (data_provider_id=145) 531 datasets have been deleted. All of them had reported a total of 37,216 records Here I provide a list of IDs of all these datasets, for easing up the inputting of them into a deletion script maybe? 8759, 8760, 8761, 8762, 8763, 8764, 8765, 8766, 8767, 8768, 8770, 8771, 8772, 8773, 8774, 8775, 8793, 8794, 8809, 8810, 8811, 8812, 8813, 8814, 8815, 8816, 8817, 8818, 8819, 8820, 8821, 8822, 8823, 8862, 8953, 8954, 9212, 9226, 9227, 9228, 9229, 9258, 9259, 9260, 9261, 9262, 9263, 9264, 9265, 9266, 9267, 9268, 9269, 9270, 9271, 9272, 9273, 9274, 9275, 9276, 9277, 9278, 9279, 9280, 9281, 9282, 9283, 9284, 9285, 9286, 9287, 9288, 9289, 9297, 9298, 9299, 9300, 9301, 9302, 9303, 9304, 9305, 9306, 9307, 9352, 9353, 9868, 9869, 9870, 9871, 9872, 9873, 9896, 9897, 9990, 9991, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010, 10031, 10032, 10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059, 10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068, 10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077, 10078, 10079, 10080, 10131, 10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158, 10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176, 10177, 10178, 10179, 10180, 10231, 10232, 10233, 10234, 10250, 10251, 10252, 10253, 10254, 10255, 10256, 10257, 10258, 10259, 10260, 10261, 10262, 10279, 10280, 10300, 10301, 10302, 10303, 10304, 10305, 10306, 10307, 10308, 10309, 10310, 10311, 10312, 10313, 10314, 10315, 10316, 10317, 10318, 10319, 10320, 10322, 10323, 10324, 10325, 10326, 10327, 10328, 10329, 10330, 10331, 10332, 10333, 10334, 10335, 10336, 10337, 10338, 10339, 10340, 10341, 10342, 10343, 10344, 10345, 10346, 10347, 10348, 10349, 10399, 10400, 10401, 10402, 10403, 10404, 10405, 10406, 10407, 10408, 10409, 10410, 10411, 10412, 10413, 10414, 10415, 10416, 10417, 10418, 10419, 10420, 10421, 10422, 10423, 10424, 10425, 10426, 10427, 10729, 10730, 10731, 10736, 10737, 10738, 10739, 10740, 10741, 10742, 10743, 10768, 10769, 10770, 10771, 10773, 10774, 12131, 12834, 12934, 2050, 2075, 2173, 2174, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2394, 2453, 2459, 2460, 2603, 5821, 5822, 5823, 5824, 5825, 5826, 5827, 5828, 5829, 5830, 5831, 5832, 5833, 5835, 5836, 5837, 5838, 5839, 5840, 5841, 5842, 5843, 5844, 5845, 5846, 5847, 5848, 5849, 5850, 5851, 5852, 5853, 5854, 5855, 5856, 5857, 5858, 5859, 5860, 5861, 5862, 6580, 6822, 6823, 6824, 6825, 6826, 6827, 6828, 7115, 7504, 7505, 7506, 7507, 7508, 7509, 7510, 7511, 7766, 7767, 7768, 7769, 7770, 7771, 7772, 7773, 7851, 7852, 7853, 7854, 8446, 8447, 8449, 8450, 8451, 8452, 8453, 8454, 8455, 8456, 8457, 8458, 8459, 8460, 8461, 8462, 8463, 8464, 8465, 8466, 8467, 8468, 8469, 8470, 8471, 8472, 8473, 8474, 8475, 8509, 8510, 8511, 8512, 8513, 8514, 8521, 8522, 8523, 8524, 8525, 8526, 8527, 8528, 8540, 8541, 8542, 8543, 8548, 8549, 8550, 8551, 8608, 8609, 8610, 8611, 8612, 8613, 8614, 8615, 8616, 8617, 8618, 8619, 8620, 8645, 8646, 8647, 8648, 8649, 3827, 3828, 3829, 3830, 3831, 3832, 4753, 4754, 4756, 4757, 4758, 4759, 4760, 4761, 4762, 4763, 5259, 5260, 5261, 5262, 5263, 5264, 5265, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5289, 5290, 5291, 5818, 5819, 5820 Doing a quick query on krayt, I guess the number 37,216 occurrences (SELECT count(*) from occurrence_record where data_resource_id IN (....those IDs...) --> all deleted via script, 26.11.12 <-- ===================================================== Senckenberg (data_provider_id=155) 18 datasets have been deleted. Reporting 2762 records. 14513, 8356, 8357, 8358, 8359, 8360, 8361, 8362, 8363, 8364, 8365, 8366, 8367, 8368, 8369, 8370, 8371, 8372 ===================================================== Comisión nacional para el conocimiento y uso de la biodiversidad (data_provider_id=213) 17 datasets have been deleted. Reporting 359,122 records 12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508 ===================================================== BeBIF Provider (data_provider_id=12) 1 dataset has been deleted. Reporting 2643 records. dataset = 90 ===================================================== Bird Studies Canada (data_provider_id=18) 1 dataset has been deleted. Reporting 0 records. dataset = 11529 ===================================================== Korea Institute of Science and Technology Information (data_provider_id=33) 1 dataset has been deleted. Reporting 40 records. dataset = 109 ===================================================== Finnish Museum of Natural History (data_provider_id=50) 1 dataset has been deleted. Reporting 173,108 records. dataset = 14040 ===================================================== San Diego Natural History Museum (data_provider_id=151) SHOULD NOT BE DELETED. See comment 1 dataset has been deleted. Reporting 19,583 records. dataset = 638 ===================================================== University of Alberta Museums (data_provider_id=178) 1 dataset has been deleted. Reporting 27,202 records. dataset = 772 ===================================================== AIT Austrian Institute of Technology GmbH (data_provider_id=465) 1 dataset has been deleted. Reporting 0 records. dataset = 14546 ]]>
Author: ahahn@gbif.org Comment: Checking with Andrei on an update to the batch deletion script for provider_id 145 - done, all removed Created: 2012-11-26 11:30:04.21 Updated: 2012-11-27 16:47:34.564
Author: jlegind@gbif.org
Created: 2012-11-27 12:19:01.547
Updated: 2012-11-27 12:19:01.547
San Diego Natural History Museum (data_provider_id=151)Mammals specimens id 638 should not be deleted.
The curator claims it is an IT department oversight.
Author: ahahn@gbif.org
Created: 2012-11-27 16:47:59.76
Updated: 2012-11-29 14:27:12.626
Senckenberg: (José)
These are 18 the problematic datasets from Senckenberg
17 of them come from a single access point - http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk
All of these access points exists in the wrapper response, but for some reason have been deleted from the Registry.
This is a problem and must be solved.
Crustacea ZMK (remote_id_at_url=Crustacea ZMK)
http://data.gbif.org/datasets/resource/8356
Ornithologie ZMK (remote_id_at_url= Ornithologie ZMK)
http://data.gbif.org/datasets/resource/8357
Ichthyologie ZMK (remote_id_at_url= Ichthyologie ZMK)
http://data.gbif.org/datasets/resource/8358
Protozoa ZMK (remote_id_at_url= Protozoa ZMK)
http://data.gbif.org/datasets/resource/8359
Mammalogie ZMK (remote_id_at_url= Mammalogie ZMK)
http://data.gbif.org/datasets/resource/8360
Amphibia ZMK (remote_id_at_url= Amphibia ZMK)
http://data.gbif.org/datasets/resource/8361
Pantopoda SMF (remote_id_at_url= Pantopoda SMF)
http://data.gbif.org/datasets/resource/8362
Reptilia ZMK (remote_id_at_url= Reptilia ZMK)
http://data.gbif.org/datasets/resource/8363
Bryozoa ZMK (remote_id_at_url= Bryozoa ZMK)
http://data.gbif.org/datasets/resource/8364
Cnidaria ZMK (remote_id_at_url= Cnidaria ZMK)
http://data.gbif.org/datasets/resource/8365
Vermes ZMK (remote_id_at_url= Vermes ZMK)
http://data.gbif.org/datasets/resource/8366
Malakologie ZMK (remote_id_at_url= Malakologie ZMK)
http://data.gbif.org/datasets/resource/8367
Echinodermata ZMK (remote_id_at_url= Echinodermata ZMK)
http://data.gbif.org/datasets/resource/8368
Porifera ZMK (remote_id_at_url= Porifera ZMK)
http://data.gbif.org/datasets/resource/8369
Arachnologie ZMK (remote_id_at_url= Arachnologie ZMK)
http://data.gbif.org/datasets/resource/8370
Ctenophora ZMK (remote_id_at_url= Ctenophora ZMK)
http://data.gbif.org/datasets/resource/8371
Tunicata ZMK (remote_id_at_url= Tunicata ZMK)
http://data.gbif.org/datasets/resource/8372
==================================
The last dataset is a different case. This resource exists in the wrapper response, but it seems it was deleted from the Registry and then created again on a later day.
Collection Mammalogie - ZSRO (remote_id_at_url=Collection Mammalogie - ZSRO)
http://data.gbif.org/datasets/resource/14513
GBIF Registry Entry
http://gbrds.gbif.org/browse/agent?uuid=f440262c-1d2c-11e2-8fd4-00145eb45e9a created on the 23rd of October
http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zsro
Currently they dont share the same UUID (portal vs registry) for obvious reasons, as the one listed in the portal was the old registration that was deleted.
=================================
TEMPORARY FIX
For the last dataset listed above, the fix will be to reassign the UUID in the portal DB to the one pointing to the Registry.
--> done, 29.11.12
For the other deleted DS, I think we should run a synchronization against the access point: http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk and see if they get added again?
If they are not added again, then a manual addition might be necessary.
--> done, 29.11.12 - most of them reappeared in the registry; updated portal.data_resource.gbif_registry_uuid on the new agent uuids. Exception: "Pantopoda SMF (remote_id_at_url= Pantopoda SMF)" (not recreated)
In any case, these are temporary fixes cause the metadata-sync has a bug that it does not recognizes properly this kind of Access Points and its doing unexpected behaviour
Author: ahahn@gbif.org
Created: 2012-11-27 16:48:45.965
Updated: 2012-11-29 14:01:24.369
Mexico (José):
For the following datasets, all of them were CORRECTLY deleted by the synchronizer, as they don't appear anymore on the access point associated to them (http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php)
I have checked the DiGIR XML Response to see if we have a match (on and ) and it does not seems so, they are gone.
============
http://data.gbif.org/datasets/resource/1593/ - Colección de Mamíferos de Nuevo León, México (UANL) - (remote id=UANL-UANL)
http://data.gbif.org/datasets/resource/1595/ - Herbario del Instituto de Ecología, A.C., México (IE-BAJIO) - (remote id=IE-BAJIO)
http://data.gbif.org/datasets/resource/1597/ - Herbario del Instituto de Ecología, A.C., México (IE-XAL) - (remote id=IE-XAL)
http://data.gbif.org/datasets/resource/1599/ - Banco Nacional de Germoplasma Vegetal, México (BANGEV, UACH) - (remote id=BANGEV-UACH)
http://data.gbif.org/datasets/resource/1601/ - Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-IPN)
http://data.gbif.org/datasets/resource/2498/ - Ejemplares tipo de plantas vasculares del Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-PLANTASVASCULARES)
http://data.gbif.org/datasets/resource/2499/ - Estudio Florístico de la Sierra de Pachuca, Hidalgo, México (ENCB, IPN) - (remote id=ENCB-ESTUDIOFLORISTICO)
http://data.gbif.org/datasets/resource/2500/ - Ictiofauna de la Región de Las Huastecas, México (ENCB, IPN) - (remote id=ENCB-ICTIOFAUNA)
http://data.gbif.org/datasets/resource/2501/ - Colección Nacional de Peces Dulceacuícolas Mexicanos de la Escuela Nacional de Ciencias Biológicas, IPN (ENCB, IPN) - (remote id=ENCB-PECES)
http://data.gbif.org/datasets/resource/2502/ - Ictiofauna de la Cuenca del Río Lerma, México (ENCB, IPN) - (remote id=ENCB-ICITOFAUNARIOLERMA)
http://data.gbif.org/datasets/resource/2503/ - Ictioplancton de las lagunas Madre y Almagre, Tamaulipas, y laguna de Tampamachoco, Veracruz, México (ENCB, IPN) - (remote id=ENCB-ICTIOPLANCTON)
http://data.gbif.org/datasets/resource/2505/ - Las familias Polyporaceae sensu stricto y Albatrellaceae en México (ENCB, IPN) - (remote id=ENCB-POLYPORACEAE_ALBATRELLACEAE)
http://data.gbif.org/datasets/resource/2506/ - Biodiversidad de los mamíferos en el Estado de Michoacán, México (ENCB, IPN) - (remote id=ENCB-MAMIFEROSMICHOACAN)
http://data.gbif.org/datasets/resource/2507/ - Estudio monográfico del género Echinopepon Naud. (Cucurbitaceae) en México (ENCB, IPN) - (remote id=ENCB-ECHINOPEPON)
http://data.gbif.org/datasets/resource/2508/ - Algas coralinas articuladas (Rhodophyta-corallinales) de México (ENCB, IPN) - (remote id=ENCB-ALGASCORALINAS)
http://data.gbif.org/datasets/resource/13105/ - Moluscos macrobénticos del intermareal y plataforma continental de Jalisco y Colima - (remote id=U_de_G_S110)
with the following dataset, this was a special case which the synchronizer handled as expected. It is just that on the Mexican side, they have bad data, here the situation:
http://data.gbif.org/search/Colecci%C3%B3n%20de%20Diatomeas
there are two datasets with the same name, but with different access points, and diffent
http://data.gbif.org/datasets/resource/11115/ (remote_id_at_url=ICMyL-DF-Diatomeas) on the access point = http://132.248.15.4/digir/DiGIR.php
http://data.gbif.org/datasets/resource/12137/ (remote_id_at_url=ICMyL-DF) on the access point = http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php
the one with ID=12137 has been removed from the DiGIR response and it was removed from the Registry as well by the synchronizer - which is good,
leaving us just the DR_ID=11115 which I think is good, but we still need to delete that from the indexing database.
======
So in conclusion, I can say with confidence the syncronizer has acted in a good way with this publisher. So here is the list of IDs (same ones as last email) so we can batch delete them from the index DB
12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508
--> all removed, 29.11.12
Author: ahahn@gbif.org
Created: 2012-11-27 16:49:36.453
Updated: 2012-11-29 14:32:57.202
other resources (José, Jan):
==========================================
BeBIF Provider (data_provider_id=12)
Metadata sync deleted good.
The resource at http://data.gbif.org/datasets/resource/90/ is gone at the source.
Solution: Delete data_resource_id=90 from the indexing DB.
--> 29.11.12
==========================================
Bird Studies Canada (data_provider_id=18)
1 conflicting DS
Our data portal has 2 duplicates of the same resource "Marsh Monitoring Program - Birds"
http://data.gbif.org/search/Marsh%20Monitoring which gives
http://data.gbif.org/datasets/resource/11529/ (remote_id_at_url=mmpbirds2)
http://data.gbif.org/datasets/resource/59/ (remote_id_at_url=mmpbirds)
Solution: In this case it will suffice deleting data_resource_id=11529 from the indexing DB.
--> deleted, 29.11.12
==========================================
Korea Institute of Science and Technology Information (data_provider_id=33)
Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/109/ is gone at the source.
Solution: Delete data_resource_id=109 from the indexing DB.
--> resource 109 deleted, 27.11.12
==========================================
San Diego Natural History Museum (data_provider_id=151)
Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/638/ is gone at the source.
Solution: Delete data_resource_id=638 from the indexing DB.
* Track with Jan, he asked the publisher if they really deleted that Dataset
--> see comment from Jan earlier: _not_ to be deleted!
==========================================
University of Alberta Museums (data_provider_id=178)
Metadata sync deleted good.
The resource at
http://data.gbif.org/datasets/resource/772/ is gone at the source.
Solution: Delete data_resource_id=772 from the indexing DB.
* Track with Jan, he asked the publisher if they really deleted that Dataset
==========================================
AIT Austrian Institute of Technology GmbH (data_provider_id=465)
The resource at http://data.gbif.org/datasets/resource/14546/ has different UUID from the one in the Registry.
But it also has different name
Portal : The DNA & Sample Repository @ AIT
vs
Registry: The DNA and Sample Repository at AIT
I don't know what happened here! how can the name be changed?
Solution : I would suggest strongly to delete the data_resource_id=14546 as it currently does not has any occurrences indexed, so better to delete it and then
reindex it from the Publisher's page (http://gbrds.gbif.org/browse/agent?uuid=81bfa2a5-22a8-4bea-b91c-d54bda1365b9) via the metadata sync.
--> 14546 deleted 27.11.12; >metadata update waiting for service restart< done
==========================================
Finnish Museum of Natural History (data_provider_id=50)
The resource at http://data.gbif.org/datasets/resource/14040/ does not share the same UUID with the resource in the Registry
http://gbrds.gbif.org/browse/agent?uuid=aea3a96a-3580-11e2-918b-00145eb45e9a
This is the known dataset (Fieldjournal.org...). This is part of the Biocase issue, deleting and then creating the same dataset.
Temporary solution: modify the UUID in the portal database, to point to the Registry one. BUT this needs a fix in the synchronizer as well.
--> UUID fixed, 27.11.12
Author: ahahn@gbif.org Comment: Most cases have been handled now. For the remaining, creating individual issues, then closing this one. Created: 2012-11-29 14:14:44.26 Updated: 2012-11-29 14:14:44.26