12407 Reporter: ahahn Assignee: ahahn Type: Task Summary: Clean up on datasets deleted from the registry Priority: Major Resolution: Fixed Status: Closed Created: 2012-11-26 11:17:07.34 Updated: 2012-12-13 15:11:06.721 Resolved: 2012-11-29 14:33:38.013 Description: Follow up registry deletions (synchronizer) with - run batch deletion script for datasets from publisher 145 - double-check that other deleted dataset indeed should be gone - check whether a duplicate entry got created a) in the registry (new uuid) / b) in the index - in case of duplicates a), update the index of the old entry with the new uuid to keep indexing into the existing dataset, b) remove the duplicate Pangea (data_provider_id=145) 531 datasets have been deleted. All of them had reported a total of 37,216 records Here I provide a list of IDs of all these datasets, for easing up the inputting of them into a deletion script maybe? 8759, 8760, 8761, 8762, 8763, 8764, 8765, 8766, 8767, 8768, 8770, 8771, 8772, 8773, 8774, 8775, 8793, 8794, 8809, 8810, 8811, 8812, 8813, 8814, 8815, 8816, 8817, 8818, 8819, 8820, 8821, 8822, 8823, 8862, 8953, 8954, 9212, 9226, 9227, 9228, 9229, 9258, 9259, 9260, 9261, 9262, 9263, 9264, 9265, 9266, 9267, 9268, 9269, 9270, 9271, 9272, 9273, 9274, 9275, 9276, 9277, 9278, 9279, 9280, 9281, 9282, 9283, 9284, 9285, 9286, 9287, 9288, 9289, 9297, 9298, 9299, 9300, 9301, 9302, 9303, 9304, 9305, 9306, 9307, 9352, 9353, 9868, 9869, 9870, 9871, 9872, 9873, 9896, 9897, 9990, 9991, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010, 10031, 10032, 10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059, 10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068, 10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077, 10078, 10079, 10080, 10131, 10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158, 10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176, 10177, 10178, 10179, 10180, 10231, 10232, 10233, 10234, 10250, 10251, 10252, 10253, 10254, 10255, 10256, 10257, 10258, 10259, 10260, 10261, 10262, 10279, 10280, 10300, 10301, 10302, 10303, 10304, 10305, 10306, 10307, 10308, 10309, 10310, 10311, 10312, 10313, 10314, 10315, 10316, 10317, 10318, 10319, 10320, 10322, 10323, 10324, 10325, 10326, 10327, 10328, 10329, 10330, 10331, 10332, 10333, 10334, 10335, 10336, 10337, 10338, 10339, 10340, 10341, 10342, 10343, 10344, 10345, 10346, 10347, 10348, 10349, 10399, 10400, 10401, 10402, 10403, 10404, 10405, 10406, 10407, 10408, 10409, 10410, 10411, 10412, 10413, 10414, 10415, 10416, 10417, 10418, 10419, 10420, 10421, 10422, 10423, 10424, 10425, 10426, 10427, 10729, 10730, 10731, 10736, 10737, 10738, 10739, 10740, 10741, 10742, 10743, 10768, 10769, 10770, 10771, 10773, 10774, 12131, 12834, 12934, 2050, 2075, 2173, 2174, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2394, 2453, 2459, 2460, 2603, 5821, 5822, 5823, 5824, 5825, 5826, 5827, 5828, 5829, 5830, 5831, 5832, 5833, 5835, 5836, 5837, 5838, 5839, 5840, 5841, 5842, 5843, 5844, 5845, 5846, 5847, 5848, 5849, 5850, 5851, 5852, 5853, 5854, 5855, 5856, 5857, 5858, 5859, 5860, 5861, 5862, 6580, 6822, 6823, 6824, 6825, 6826, 6827, 6828, 7115, 7504, 7505, 7506, 7507, 7508, 7509, 7510, 7511, 7766, 7767, 7768, 7769, 7770, 7771, 7772, 7773, 7851, 7852, 7853, 7854, 8446, 8447, 8449, 8450, 8451, 8452, 8453, 8454, 8455, 8456, 8457, 8458, 8459, 8460, 8461, 8462, 8463, 8464, 8465, 8466, 8467, 8468, 8469, 8470, 8471, 8472, 8473, 8474, 8475, 8509, 8510, 8511, 8512, 8513, 8514, 8521, 8522, 8523, 8524, 8525, 8526, 8527, 8528, 8540, 8541, 8542, 8543, 8548, 8549, 8550, 8551, 8608, 8609, 8610, 8611, 8612, 8613, 8614, 8615, 8616, 8617, 8618, 8619, 8620, 8645, 8646, 8647, 8648, 8649, 3827, 3828, 3829, 3830, 3831, 3832, 4753, 4754, 4756, 4757, 4758, 4759, 4760, 4761, 4762, 4763, 5259, 5260, 5261, 5262, 5263, 5264, 5265, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5289, 5290, 5291, 5818, 5819, 5820 Doing a quick query on krayt, I guess the number 37,216 occurrences (SELECT count(*) from occurrence_record where data_resource_id IN (....those IDs...) --> all deleted via script, 26.11.12 <-- ===================================================== Senckenberg (data_provider_id=155) 18 datasets have been deleted. Reporting 2762 records. 14513, 8356, 8357, 8358, 8359, 8360, 8361, 8362, 8363, 8364, 8365, 8366, 8367, 8368, 8369, 8370, 8371, 8372 ===================================================== Comisión nacional para el conocimiento y uso de la biodiversidad (data_provider_id=213) 17 datasets have been deleted. Reporting 359,122 records 12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508 ===================================================== BeBIF Provider (data_provider_id=12) 1 dataset has been deleted. Reporting 2643 records. dataset = 90 ===================================================== Bird Studies Canada (data_provider_id=18) 1 dataset has been deleted. Reporting 0 records. dataset = 11529 ===================================================== Korea Institute of Science and Technology Information (data_provider_id=33) 1 dataset has been deleted. Reporting 40 records. dataset = 109 ===================================================== Finnish Museum of Natural History (data_provider_id=50) 1 dataset has been deleted. Reporting 173,108 records. dataset = 14040 ===================================================== San Diego Natural History Museum (data_provider_id=151) SHOULD NOT BE DELETED. See comment 1 dataset has been deleted. Reporting 19,583 records. dataset = 638 ===================================================== University of Alberta Museums (data_provider_id=178) 1 dataset has been deleted. Reporting 27,202 records. dataset = 772 ===================================================== AIT Austrian Institute of Technology GmbH (data_provider_id=465) 1 dataset has been deleted. Reporting 0 records. dataset = 14546 ]]>
Author: ahahn@gbif.org Comment: Checking with Andrei on an update to the batch deletion script for provider_id 145 - done, all removed Created: 2012-11-26 11:30:04.21 Updated: 2012-11-27 16:47:34.564
Author: jlegind@gbif.org Created: 2012-11-27 12:19:01.547 Updated: 2012-11-27 12:19:01.547 San Diego Natural History Museum (data_provider_id=151)Mammals specimens id 638 should not be deleted. The curator claims it is an IT department oversight.
Author: ahahn@gbif.org Created: 2012-11-27 16:47:59.76 Updated: 2012-11-29 14:27:12.626 Senckenberg: (José) These are 18 the problematic datasets from Senckenberg 17 of them come from a single access point - http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk All of these access points exists in the wrapper response, but for some reason have been deleted from the Registry. This is a problem and must be solved. Crustacea ZMK (remote_id_at_url=Crustacea ZMK) http://data.gbif.org/datasets/resource/8356 Ornithologie ZMK (remote_id_at_url= Ornithologie ZMK) http://data.gbif.org/datasets/resource/8357 Ichthyologie ZMK (remote_id_at_url= Ichthyologie ZMK) http://data.gbif.org/datasets/resource/8358 Protozoa ZMK (remote_id_at_url= Protozoa ZMK) http://data.gbif.org/datasets/resource/8359 Mammalogie ZMK (remote_id_at_url= Mammalogie ZMK) http://data.gbif.org/datasets/resource/8360 Amphibia ZMK (remote_id_at_url= Amphibia ZMK) http://data.gbif.org/datasets/resource/8361 Pantopoda SMF (remote_id_at_url= Pantopoda SMF) http://data.gbif.org/datasets/resource/8362 Reptilia ZMK (remote_id_at_url= Reptilia ZMK) http://data.gbif.org/datasets/resource/8363 Bryozoa ZMK (remote_id_at_url= Bryozoa ZMK) http://data.gbif.org/datasets/resource/8364 Cnidaria ZMK (remote_id_at_url= Cnidaria ZMK) http://data.gbif.org/datasets/resource/8365 Vermes ZMK (remote_id_at_url= Vermes ZMK) http://data.gbif.org/datasets/resource/8366 Malakologie ZMK (remote_id_at_url= Malakologie ZMK) http://data.gbif.org/datasets/resource/8367 Echinodermata ZMK (remote_id_at_url= Echinodermata ZMK) http://data.gbif.org/datasets/resource/8368 Porifera ZMK (remote_id_at_url= Porifera ZMK) http://data.gbif.org/datasets/resource/8369 Arachnologie ZMK (remote_id_at_url= Arachnologie ZMK) http://data.gbif.org/datasets/resource/8370 Ctenophora ZMK (remote_id_at_url= Ctenophora ZMK) http://data.gbif.org/datasets/resource/8371 Tunicata ZMK (remote_id_at_url= Tunicata ZMK) http://data.gbif.org/datasets/resource/8372 ================================== The last dataset is a different case. This resource exists in the wrapper response, but it seems it was deleted from the Registry and then created again on a later day. Collection Mammalogie - ZSRO (remote_id_at_url=Collection Mammalogie - ZSRO) http://data.gbif.org/datasets/resource/14513 GBIF Registry Entry http://gbrds.gbif.org/browse/agent?uuid=f440262c-1d2c-11e2-8fd4-00145eb45e9a created on the 23rd of October http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zsro Currently they dont share the same UUID (portal vs registry) for obvious reasons, as the one listed in the portal was the old registration that was deleted. ================================= TEMPORARY FIX For the last dataset listed above, the fix will be to reassign the UUID in the portal DB to the one pointing to the Registry. --> done, 29.11.12 For the other deleted DS, I think we should run a synchronization against the access point: http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=zmk and see if they get added again? If they are not added again, then a manual addition might be necessary. --> done, 29.11.12 - most of them reappeared in the registry; updated portal.data_resource.gbif_registry_uuid on the new agent uuids. Exception: "Pantopoda SMF (remote_id_at_url= Pantopoda SMF)" (not recreated) In any case, these are temporary fixes cause the metadata-sync has a bug that it does not recognizes properly this kind of Access Points and its doing unexpected behaviour
Author: ahahn@gbif.org Created: 2012-11-27 16:48:45.965 Updated: 2012-11-29 14:01:24.369 Mexico (José): For the following datasets, all of them were CORRECTLY deleted by the synchronizer, as they don't appear anymore on the access point associated to them (http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php) I have checked the DiGIR XML Response to see if we have a match (onand ) and it does not seems so, they are gone. ============ http://data.gbif.org/datasets/resource/1593/ - Colección de Mamíferos de Nuevo León, México (UANL) - (remote id=UANL-UANL) http://data.gbif.org/datasets/resource/1595/ - Herbario del Instituto de Ecología, A.C., México (IE-BAJIO) - (remote id=IE-BAJIO) http://data.gbif.org/datasets/resource/1597/ - Herbario del Instituto de Ecología, A.C., México (IE-XAL) - (remote id=IE-XAL) http://data.gbif.org/datasets/resource/1599/ - Banco Nacional de Germoplasma Vegetal, México (BANGEV, UACH) - (remote id=BANGEV-UACH) http://data.gbif.org/datasets/resource/1601/ - Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-IPN) http://data.gbif.org/datasets/resource/2498/ - Ejemplares tipo de plantas vasculares del Herbario de la Escuela Nacional de Ciencias Biológicas, México (ENCB, IPN) - (remote id=ENCB-PLANTASVASCULARES) http://data.gbif.org/datasets/resource/2499/ - Estudio Florístico de la Sierra de Pachuca, Hidalgo, México (ENCB, IPN) - (remote id=ENCB-ESTUDIOFLORISTICO) http://data.gbif.org/datasets/resource/2500/ - Ictiofauna de la Región de Las Huastecas, México (ENCB, IPN) - (remote id=ENCB-ICTIOFAUNA) http://data.gbif.org/datasets/resource/2501/ - Colección Nacional de Peces Dulceacuícolas Mexicanos de la Escuela Nacional de Ciencias Biológicas, IPN (ENCB, IPN) - (remote id=ENCB-PECES) http://data.gbif.org/datasets/resource/2502/ - Ictiofauna de la Cuenca del Río Lerma, México (ENCB, IPN) - (remote id=ENCB-ICITOFAUNARIOLERMA) http://data.gbif.org/datasets/resource/2503/ - Ictioplancton de las lagunas Madre y Almagre, Tamaulipas, y laguna de Tampamachoco, Veracruz, México (ENCB, IPN) - (remote id=ENCB-ICTIOPLANCTON) http://data.gbif.org/datasets/resource/2505/ - Las familias Polyporaceae sensu stricto y Albatrellaceae en México (ENCB, IPN) - (remote id=ENCB-POLYPORACEAE_ALBATRELLACEAE) http://data.gbif.org/datasets/resource/2506/ - Biodiversidad de los mamíferos en el Estado de Michoacán, México (ENCB, IPN) - (remote id=ENCB-MAMIFEROSMICHOACAN) http://data.gbif.org/datasets/resource/2507/ - Estudio monográfico del género Echinopepon Naud. (Cucurbitaceae) en México (ENCB, IPN) - (remote id=ENCB-ECHINOPEPON) http://data.gbif.org/datasets/resource/2508/ - Algas coralinas articuladas (Rhodophyta-corallinales) de México (ENCB, IPN) - (remote id=ENCB-ALGASCORALINAS) http://data.gbif.org/datasets/resource/13105/ - Moluscos macrobénticos del intermareal y plataforma continental de Jalisco y Colima - (remote id=U_de_G_S110) with the following dataset, this was a special case which the synchronizer handled as expected. It is just that on the Mexican side, they have bad data, here the situation: http://data.gbif.org/search/Colecci%C3%B3n%20de%20Diatomeas there are two datasets with the same name, but with different access points, and diffent
http://data.gbif.org/datasets/resource/11115/ (remote_id_at_url=ICMyL-DF-Diatomeas) on the access point = http://132.248.15.4/digir/DiGIR.php http://data.gbif.org/datasets/resource/12137/ (remote_id_at_url=ICMyL-DF) on the access point = http://conabioweb.conabio.gob.mx:8050/digir/DIGIR.php the one with ID=12137 has been removed from the DiGIR response and it was removed from the Registry as well by the synchronizer - which is good, leaving us just the DR_ID=11115 which I think is good, but we still need to delete that from the indexing database. ====== So in conclusion, I can say with confidence the syncronizer has acted in a good way with this publisher. So here is the list of IDs (same ones as last email) so we can batch delete them from the index DB 12137, 13105, 1593, 1595, 1597, 1599, 1601, 2498, 2499, 2500, 2501, 2502, 2503, 2505, 2506, 2507, 2508 --> all removed, 29.11.12
Author: ahahn@gbif.org Created: 2012-11-27 16:49:36.453 Updated: 2012-11-29 14:32:57.202 other resources (José, Jan): ========================================== BeBIF Provider (data_provider_id=12) Metadata sync deleted good. The resource at http://data.gbif.org/datasets/resource/90/ is gone at the source. Solution: Delete data_resource_id=90 from the indexing DB. --> 29.11.12 ========================================== Bird Studies Canada (data_provider_id=18) 1 conflicting DS Our data portal has 2 duplicates of the same resource "Marsh Monitoring Program - Birds" http://data.gbif.org/search/Marsh%20Monitoring which gives http://data.gbif.org/datasets/resource/11529/ (remote_id_at_url=mmpbirds2) http://data.gbif.org/datasets/resource/59/ (remote_id_at_url=mmpbirds) Solution: In this case it will suffice deleting data_resource_id=11529 from the indexing DB. --> deleted, 29.11.12 ========================================== Korea Institute of Science and Technology Information (data_provider_id=33) Metadata sync deleted good. The resource at http://data.gbif.org/datasets/resource/109/ is gone at the source. Solution: Delete data_resource_id=109 from the indexing DB. --> resource 109 deleted, 27.11.12 ========================================== San Diego Natural History Museum (data_provider_id=151) Metadata sync deleted good. The resource at http://data.gbif.org/datasets/resource/638/ is gone at the source. Solution: Delete data_resource_id=638 from the indexing DB. * Track with Jan, he asked the publisher if they really deleted that Dataset --> see comment from Jan earlier: _not_ to be deleted! ========================================== University of Alberta Museums (data_provider_id=178) Metadata sync deleted good. The resource at http://data.gbif.org/datasets/resource/772/ is gone at the source. Solution: Delete data_resource_id=772 from the indexing DB. * Track with Jan, he asked the publisher if they really deleted that Dataset ========================================== AIT Austrian Institute of Technology GmbH (data_provider_id=465) The resource at http://data.gbif.org/datasets/resource/14546/ has different UUID from the one in the Registry. But it also has different name Portal : The DNA & Sample Repository @ AIT vs Registry: The DNA and Sample Repository at AIT I don't know what happened here! how can the name be changed? Solution : I would suggest strongly to delete the data_resource_id=14546 as it currently does not has any occurrences indexed, so better to delete it and then reindex it from the Publisher's page (http://gbrds.gbif.org/browse/agent?uuid=81bfa2a5-22a8-4bea-b91c-d54bda1365b9) via the metadata sync. --> 14546 deleted 27.11.12; >metadata update waiting for service restart< done ========================================== Finnish Museum of Natural History (data_provider_id=50) The resource at http://data.gbif.org/datasets/resource/14040/ does not share the same UUID with the resource in the Registry http://gbrds.gbif.org/browse/agent?uuid=aea3a96a-3580-11e2-918b-00145eb45e9a This is the known dataset (Fieldjournal.org...). This is part of the Biocase issue, deleting and then creating the same dataset. Temporary solution: modify the UUID in the portal database, to point to the Registry one. BUT this needs a fix in the synchronizer as well. --> UUID fixed, 27.11.12
Author: ahahn@gbif.org Comment: Most cases have been handled now. For the remaining, creating individual issues, then closing this one. Created: 2012-11-29 14:14:44.26 Updated: 2012-11-29 14:14:44.26