Issue 11951

registry-metadata-sync: new BioCASE Dataset's type not populated

11951
Reporter: ahahn
Assignee: fmendez
Type: Bug
Summary: registry-metadata-sync: new BioCASE Dataset's type not populated
Priority: Blocker
Resolution: Fixed
Status: Closed
Created: 2012-09-27 12:17:29.112
Updated: 2013-12-16 17:50:33.276
Resolved: 2012-10-18 10:30:26.808
        
Description: A registry synchronisation for BioCASe generates a metadata scheduling operator ("red guy) in the HIT that uses the Technical installation homepage URL instead of the Service URL. As the homepage URL was generated as an artificial grouping help during migration, it never constitutes a funtional access point. This means that BioCASe resources cannot be expanded and hence not be indexed at all.

Example: http://gbrds.gbif.org/browse/agent?uuid=603e8800-f762-11e1-a439-00145eb45e9a ("BioCASe Installation AIT Austrian Institute of Technology GmbH), homepage URL: http://picme.ait.ac.at/pywrapper.cgi; endpoint URL:  http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples. HIT after registry synchronisation (http://hit.gbif.org/datasource/list.html?filter=provider&value=ait): URL=http://picme.ait.ac.at/pywrapper.cgi, should be http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples.



[11:54:20] Andrea Hahn: at the agent page, the endpoint is registered as http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples. on a registry synch, the HIT pulls it in as http://picme.ait.ac.at/pywrapper.cgi, however - this can never work, and I am wondering why that is
[11:54:43] Andrea Hahn: if you have HIT access: the page is http://hit.gbif.org/datasource/list.html?filter=provider&value=ait
[11:55:10] Federico Mendez: ok...i think I know why is that happening
[11:56:54] Federico Mendez: the migrations scripts do this, take all url that share the same "root" and put them behind a technical installation, an tech installation is created for the root url...for example...
[11:57:35] Federico Mendez: in the old registry we had: http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples2 and http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples3 ...basically a service for each url
[11:58:35] Federico Mendez: the script takes all those 3 urls and creates a Tech Inst using the url  http://picme.ait.ac.at/pywrapper.cgi; and then creates three services attached to the tech inst containing the old urls
[11:59:20] Federico Mendez: so, if the HIT is getting the wrong url is because is using the Tech Inst homepage/url instead of using the service url
]]>
    

Attachment Screen Shot 2012-10-15 at 3.35.57 PM.png



Author: kbraak@gbif.org
Created: 2012-10-12 14:24:50.202
Updated: 2012-10-12 14:24:50.202
        
The registry synchronisation runs for Austrian technical installation, and a new Dataset is successfully added. The problem is that the new Dataset doesn't have type: "OCCURRENCE". Below you can see the HIT logging from synchronization.

Compare the Austrian dataset (http://b3g4.gbif.org:8080/registry-ws/dataset/c0f99ffe-1463-11e2-9d5e-00145eb45e9a) with this German one (http://b3g4.gbif.org:8080/registry-ws/dataset/85685a84-f762-11e1-a439-00145eb45e9a) which indexes fine.

So that everybody is aware, all the HIT sends to the registry synchronizer is the technical installation UUID and the BioDatasource ID (used for logging). The registry synchroniser looks up everything else it needs (including endpoint URLs) using the registry web services.

Since a technical installation can have multiple endpoints that all share the same "root", the HIT just takes the first endpoint it encounters and preserves the root. In this case, it's just a coincidence that the root and the homepageURL are the same.

For an example of a technical installation with multiple URLs, take this one from the BGBM (http://gbrds.gbif.org/browse/agent?uuid=60454014-f762-11e1-a439-00145eb45e9a).

So the real question is, why doesn't the new Dataset created by the synchronizer contain the type=OCCURRENCE ?


**********
HIT LOGGING
**********

2012-10-12 13:55:51.0	Finished registry sync by Organization
2012-10-12 13:55:51.0	Dataset with key c0f99ffe-1463-11e2-9d5e-00145eb45e9a was not synchronized because the type was NULL
2012-10-12 13:55:51.0	Dataset with key c0f99ffe-1463-11e2-9d5e-00145eb45e9a was not synchronized because the type was NULL
2012-10-12 13:55:51.0	BioDatasource exits for name: 603e8800 and uuid: 3784 - information updated.
2012-10-12 13:55:51.0	 Writing to file: /mnt/fiber/super_hit/ait_austrian_institute_of_technology_gmbh-81bfa2a5/provider_contact.txt
2012-10-12 13:55:51.0	Synchronizing installation with key: 603e8800-f762-11e1-a439-00145eb45e9a
2012-10-12 13:55:51.0	Start registry sync by Organization

2012-10-12 13:55:44.0	Service 07db8d50-130d-4910-9165-d02bb6aa6983 synchronized, status: OK
2012-10-12 13:55:44.0	Technical Installation 603e8800-f762-11e1-a439-00145eb45e9a synchronized, status: OK
2012-10-12 13:55:44.0	Technical installation 603e8800-f762-11e1-a439-00145eb45e9a synchronization finished
2012-10-12 13:55:41.0	Synchronizing service 07db8d50-130d-4910-9165-d02bb6aa6983, url: http://picme.ait.ac.at/pywrapper.cgi?dsa=AIT_Rep_Centre_Samples
2012-10-12 13:55:41.0	Initiating agent(technical installation) synchronization 603e8800-f762-11e1-a439-00145eb45e9a
2012-10-12 13:55:41.0	Technical Installation synchronization started 603e8800-f762-11e1-a439-00145eb45e9a



    


Author: kbraak@gbif.org
Comment: This issue is a blocker for the Super HIT. Since the dataset type is not populated, the Super HIT doesn't know this resource is an Occurrence dataset or not.
Created: 2012-10-12 14:38:03.183
Updated: 2012-10-12 14:38:03.183


Author: kbraak@gbif.org
Comment: Screenshot showing all datasets currently missing category (type = occurrence)
Created: 2012-10-15 15:36:45.239
Updated: 2012-10-15 15:36:45.239


Author: fmendez@gbif.org
Created: 2012-10-18 10:30:26.84
Updated: 2012-10-18 10:30:26.84
        
Bug fixed with commit: http://code.google.com/p/gbif-registry/source/detail?r=3288
Since datasets created by migration scripts could have a invalid dataset type, MetadataSynchronizerBase sets the dataset type for each update/create operation.