Issue 12027

registry-metadata-sync: duplicate datasets on BioCASE

12027
Reporter: kbraak
Assignee: fmendez
Type: Bug
Summary: registry-metadata-sync: duplicate datasets on BioCASE
Priority: Blocker
Resolution: WontFix
Status: Closed
Created: 2012-10-15 15:30:17.362
Updated: 2013-12-16 17:50:20.188
Resolved: 2012-10-18 10:23:47.24
        
Description: A BioCASE technical installation ( http://gbrds.gbif.org/browse/agent?uuid=603b2dd6-f762-11e1-a439-00145eb45e9a ) has 4 registered endpoints:

BIOCASE - http://dsibib.mnhn.fr/biocase/pywrapper.cgi?dsa=arachne
BIOCASE - http://dsibib.mnhn.fr/biocase/pywrapper.cgi?dsa=coleoptere
BIOCASE - http://dsibib.mnhn.fr/biocase/pywrapper.cgi?dsa=mycobase
BIOCASE - http://dsibib.mnhn.fr/biocase/pywrapper.cgi?dsa=reptamph

The result of metadata synchronization, is that each endpoint produces 4 different datasets. Plus there are others. Here's the result:

4 x Arachnides datasets
4 x Coleopteres datasets
4 x Ensiferes datasets
4 x Marine invertebrates, mollusca and crustacea datasets
4 x MNHN Reptiles and Amphibians datasets
4 x Ressources fongiques datasets

Actually, the iventory (scan) from the endpoint is http://dsibib.mnhn.fr/biocase/pywrapper.cgi?dsa=arachne

    
      Arachnides
      Coleopteres
      Ensiferes
      MNHN Reptiles and Amphibians Collection Catalog
      Marine invertebrates, mollusca and crustacea
      Ressources fongiques
    

I suspect it is because all 4 endpoints expose all 6 dataset behind it, and each time one endpoint synchronizes, each dataset gets added/updated again.

Surely they all use the same dataset name (code) and we can update each dataset instead of creating a new one for each endpoint?

I attach a copy of the logs from the synchronization on the technical installation for our reference.

Thanks
]]>
    

Attachment Screen Shot 2012-10-15 at 3.29.37 PM.png

Attachment agent-603b2dd6-f762-11e1-a439-00145eb45e9a.log


Author: fmendez@gbif.org
Comment: the metadata-sync process each endpoint separately , that means each endpoint contains each own set of dataset...i understand that in this case we have several endpoints serving the same six datasets?...that case is not handled by the synchronizer at all...and I don't know if we should manage that situation; for the metadata-sync those are different datasets with the same name behind different endpoints
Created: 2012-10-17 12:22:40.12
Updated: 2012-10-17 12:22:40.12


Author: fmendez@gbif.org
Comment: The real issue here is that we are harvesting 3 endpoints that serve the same 6 datasets, for the metadata-synchronizer is not possible to know if those datasets are the same or not, the synchronizer assumes that each endpoint serves its own datasets.
Created: 2012-10-18 10:23:47.264
Updated: 2012-10-18 10:23:47.264