Issue 11830

duplicate resource access points created (metadata synch from registry into current portal db)

11830
Reporter: ahahn
Assignee: kbraak
Type: Bug
Summary: duplicate resource access points created (metadata synch from registry into current portal db)
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2012-09-12 13:59:36.509
Updated: 2012-11-22 16:44:24.705
Resolved: 2012-11-22 16:44:20.211
        
Description: On synchronisation of metadata from the registry into the data portal index (rancor/krayt), new access points get created instead of re-using existing ones. The reasons for this seems to be that instead of registry.service.uuid, registry.service.id is used to compare against portal.resource_access_point.uuid. This leads to generation of resource_access_point records with uuids like "2174.

registry: select * from service where agent_id=1492;
portal index: select * from resource_access_point where data_resource_id = 12034;]]>
    


Author: kbraak@gbif.org
Created: 2012-09-12 16:00:36.352
Updated: 2012-09-12 16:00:36.352
        
Actually, uniqueness of resource access points (RAP) has always been based on data_resource_id (DR ID), data_provider_id (DP ID), and remote_id_at_url. You can ignore the resource access point uuid, this is just extra info added to the table and given access points/services changing so often in our UDDI/Registry, has never been used as a stable RAP identifier.

For DiGIR and BioCASE, the remote_id_at_url is being stored in the Registry and will be used in RAP lookups. This is done in the endpoint.code section, see:

http://b3g4.gbif.org:8080/registry-ws/dataset/86594bf6-f762-11e1-a439-00145eb45e9a

endpoints: [
{
id: 11660,
type: "DIGIR_MANIS",
url: "http://digir.ansp.org/digir/DiGIR.php",
preferred: false,
code: "Herpetology",
networkEntityKey: "86594bf6-f762-11e1-a439-00145eb45e9a"
}

For TAPIR and DwC-A, there is no longer a remote_id_at url stored in the Registry and therefore RAP lookup only uses the DR ID and DP ID.

Yes, in the past the harvesters derived TAPIR's remote_id_at_url from its access point URL, but since TAPIR and DwC-A datasets are always 1 dataset: 1 access point, I'd vote that:

instead of storing TAPIR RAP/code in the Registry, we just take care during synchronization to ensure that only a single resource access point exists for TAPIR and DwC-A datasets. Thoughts?

 
    


Author: kbraak@gbif.org
Created: 2012-09-13 14:24:13.652
Updated: 2012-09-13 14:24:13.652
        
Fix committed in r1796.

I also updated the code, to no longer write the null resource access points for BioCASE since this was only ever used by the 1st generation harvesters.

I will put this fix up later today. 
    


Author: kbraak@gbif.org
Created: 2012-10-12 15:40:18.939
Updated: 2012-10-12 15:40:18.939
        
Andrea, we agreed for the resource_access_point.uuid, that the endpoint UUID would be used instead of the endpoint ID.

Forgive me, but I just remember now why the endpoint ID is used instead. The reason is there is no endpoint UUID, only a networkEntityKey For example, see the endpoint in this dataset: http://b3g4.gbif.org:8080/registry-ws/dataset/16e1958f-cb0a-445d-98cf-4073b344f268

Therefore, we could consider 1) changing the resource_access_point table by changing the uuid column name to id, or 2) simply removing the column since I don't think it is used for anything.  
    


Author: kbraak@gbif.org
Created: 2012-11-22 15:55:44.924
Updated: 2012-11-22 15:55:44.924
        
It was agreed that instead of changing the database, which could have cascading effects on the Web App, we would just set the RAP.UUID to null on updates now.

This change was committed in http://code.google.com/p/gbif-indexingtoolkit/source/detail?r=1806

Can you please confirm this is now happening, so that I can close this issue? Thanks
    


Author: ahahn@gbif.org
Comment: Tested and confirmed for BioCASe, IPT, DiGIR. Overtaken by earlier updates for TAPIR with data looking ok. Can be closed.
Created: 2012-11-22 16:43:49.953
Updated: 2012-11-22 16:43:49.953