Issue 14223
Duplicate/additional records appear when using the migrate-datasets.xml scripts
14223
Reporter: jlegind
Assignee: trobertson
Type: Bug
Summary: Duplicate/additional records appear when using the migrate-datasets.xml scripts
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2013-10-14 14:27:47.296
Updated: 2013-12-16 17:50:27.454
Resolved: 2013-10-15 15:32:45.209
Description: (This is a potential blocker)
The MySQL query in the script occasionally creates rows that contain the exact same values.
Here is the MySQL part:
SELECT a.uuid AS uuid_key,
a.uuid AS parent_dataset_key,
aHosting.uuid AS installation_key,
aOwning.uuid AS owning_organization_key,
CASE a.category WHEN 18000 THEN 'CHECKLIST' WHEN 18010 THEN 'OCCURRENCE' WHEN 18020 THEN 'METADATA' ELSE 'OCCURRENCE' END AS type,
a.name AS title,
a.alias AS alias,
CASE WHEN char_length(trim(a.description))<1 THEN null ELSE a.description END AS description,
CASE WHEN a.language IS NULL THEN 'en' WHEN char_length(trim(a.language))!=2 THEN 'en' ELSE a.language END AS language,
CASE WHEN trim(replace(a.homepage, ' ', '%20')) = '' THEN null
WHEN trim(a.homepage) NOT REGEXP '^[a-z0-9]{2,}://|^[a-z0-9]{2,}\\.' THEN IF(a.homepage REGEXP '`', REPLACE(a.homepage, '`',''), NULL )
WHEN trim(a.homepage) REGEXP '[[:space:]]' THEN REPLACE(a.homepage, ' ','')
WHEN trim(replace(a.homepage, ' ', '%20')) NOT LIKE '%http%' THEN CONCAT('http://', trim(replace(a.homepage, ' ', '%20')))
ELSE trim(a.homepage) END AS homepage,
CASE WHEN trim(replace(a.logoURL, ' ', '%20')) = '' THEN null
WHEN trim(replace(a.logoURL, ' ', '%20')) NOT LIKE '%http%' THEN CONCAT('http://', trim(replace(a.logoURL, ' ', '%20')))
ELSE trim(replace(a.logoURL, ' ', '%20')) END AS logo_url,
CASE WHEN char_length(trim(ep.citation_text))<1 THEN null ELSE ep.citation_text END AS citation,
CASE WHEN char_length(trim(ep.citation_identifier))<1 THEN null ELSE ep.citation_identifier END AS citation_identifier,
CASE WHEN char_length(trim(ep.rights))<1 THEN null ELSE ep.rights END AS rights,
CASE WHEN a.admin_override IS NULL THEN FALSE ELSE a.admin_override END AS locked_for_auto_update,
COALESCE(a.created, a.modified, now()) AS created,
COALESCE(a.modified, a.created, now()) AS modified,
a.deleted,
'registry-migration.gbif.org' AS created_by,
'registry-migration.gbif.org' AS modified_by
FROM agent a
JOIN agent_relation arOwning ON a.id=arOwning.to_agent_id
JOIN agent_relation arHosting ON a.id=arHosting.to_agent_id
JOIN agent aOwning ON aOwning.id=arOwning.from_agent_id
JOIN agent aHosting ON aHosting.id=arHosting.from_agent_id
LEFT JOIN extended_property ep ON a.id=ep.agent_id
WHERE a.type=14020
AND arOwning.relation_type_id=15010
AND arHosting.relation_type_id=15020
AND aOwning.type=14010
AND aHosting.type=14100
AND aHosting.name NOT IN('IPT_TEST', 'Test IPT','Test IPT','Test IPT','Test IPT','TEST IPT FROM K','TEST IPT FROM K #3', 'Test 2.0.2 Kyle', 'TEST IPT 204 Pre-release (edited 5)')
/*
Prefer non deleted records, and newest installation first.
The joins bring back historical owners (e.g. a dataset once was owned by someone, but now it is moved) and also datasets served by multiple installations.
9d392768-27fc-11e2-85e3-00145eb45e9a for example is served by a DiGIR 5feb96ea-f762-11e1-a439-00145eb45e9a and an IPT 9afa1395-6e93-4848-a42d-bce896f5195e - we prefer the IPT.
The first one wins in the insert below
*/
ORDER BY arOwning.deleted, arHosting.deleted, arHosting.created DESC;
??
*(Might be best to take it from the scriptella file)*
Look at the Fishbase records in the result. Also "Fossil (GNHM-FO)" exhibits strange behavior, as do all the GNHM records. The combination of installation_key and owning_organization_key smells a bit cartesian
And "FishBase in the Catalogue of Life" has two different owning organization UUIDs.]]>
Author: kbraak@gbif.org
Comment: I had to consult with [~bko@gbif.org] on the progress Korea made migrating from DiGIR to the IPT. We agreed 5feb96ea-f762-11e1-a439-00145eb45e9a should have been deleted, so I deleted it. Burke will follow up with Korea on the last remaining DiGIR dataset called "mushroom" to see if it, and therefore the installation (and organization too) can all be deleted.
Created: 2013-10-14 16:38:49.476
Updated: 2013-10-14 16:38:49.476
Author: jlegind@gbif.org
Created: 2013-10-15 15:32:25.982
Updated: 2013-10-15 15:32:25.982
Tim explained this by pointing to the comment inside the SQL statement ( /* Prefer non deleted records, and newest installation first... */ )
I consider this issue closed.