Issue 12346

Test the Crawler by crawling all available endpoints

12346
Reporter: lfrancke
Assignee: lfrancke
Type: Task
Summary: Test the Crawler by crawling all available endpoints
Priority: Critical
Resolution: Fixed
Status: Closed
Created: 2012-11-21 23:36:55.713
Updated: 2013-12-17 16:13:02.03
Resolved: 2013-01-15 13:06:49.016
        
Description: This issue is only concerned about BioCASe, DiGIR and TAPIR.

We need to thoroughly test the Crawler before being able to put it into production. There are a few things that we need to achieve this testing:

* Set up the infrastructure to run the Crawler (Should probably not be on staging so we need to find somewhere else for this to live)
** Crawler Coordinator
** Crawler
** RabbitMQ (messaging.gbif.org)
** ZooKeeper (c1n1.gbif.org)
** Logging listener (configured to log per file per uuid)
** ???

* Test the Coordinator by scheduling all Datasets and investigate all errors
** Make a note of the distribution of the errors. I suspect most of them will be due to wrong and/or missing endpoints etc.

* We need to capture and assess the output of the actual crawling
** Tim: Good point.  Why don't we isolate this to a crawler _only_ test, and do the monitoring needed to support only that component?
** Compare the number of records we got to the number of records the current HIT gets
*** Tim: Could be difficult, but Tim will investigate if these can be extracted from the DB behind the HIT.  We could use the ROR table, but that introduces new inconsistencies (e.g. might have errors in HIT->DB sync).  Depends if you want to isolate crawl tests or not.
** Capture any errors that occurred during crawls and analyze them all
** Capture all logs/messages that occurred during a crawl]]>
    
Attachment report_super_hit_23_nov_2012.txt


Author: kbraak@gbif.org
Created: 2012-11-22 15:17:29.656
Updated: 2012-11-22 15:17:29.656
        
[~ahahn@gbif.org][~trobertson@gbif.org][~lfrancke@gbif.org] Please find the report attached, in tab delimited format, with the following 25 columns of information:

DataProviderId (PORTAL)
DataProviderName (HIT)
DataProviderCreated (PORTAL)
DataProviderModified (PORTAL)
DataProviderDeleted (PORTAL)
DataResourceId (PORTAL)
DataResourceRegistryUUID (PORTAL)
DataResourceName (PORTAL)
DataResourceCreated (PORTAL)
DataResourceModified (PORTAL)
DataResourceDeleted (PORTAL)
BioDatasourceName (HIT)
RemoteIdAtURL (PORTAL)
TargetRecords (HIT)
MaxHarvestedRecords (HIT)
HarvestedRecords (HIT)
DroppedRecords (HIT)
DateLastHarvested (HIT)
RawOccurrenceRecordCount (PORTAL)
OccurrenceRecordCount (PORTAL)
ResourceAccessPointId (PORTAL)
URL (HIT)
BioDatasourceId (HIT)
ResourceAccessPointUUID (HIT)
DataProviderUUID (HIT)

I counted just over 1000 rows having a DataResourceRegistryUUID. Exactly how many of these are DiGIR/BioCASE/TAPIR I can't say.

Hope this is helpful, even though it likely won't provide you with counts for _all_ DiGIR/BioCASE/TAPIR datasets.

If you need explanation of what any of the columns mean, just ask. 
    


Author: kbraak@gbif.org
Comment: Sorry, I had to delete the attached report, but I will attach a revised version of the report tomorrow. The data resource id lookup in the script had to be modified for the dataset aware database. That's why there were so few rows having a DataResourceRegistryUUID. 
Created: 2012-11-22 16:33:12.127
Updated: 2012-11-22 16:33:12.127


Author: kbraak@gbif.org
Created: 2012-11-23 10:13:59.11
Updated: 2012-11-23 10:13:59.11
        
Please find the revised report attached, in tab delimited format. There are now 13163 rows having a DataResourceRegistryUUID. There is also a new column added called HarvesterFactory (HIT) that distinguishes if it is DiGIR, BioCASE, TAPIR, or DwC-A.

Counts by HarvesterFactory are:

9315 DiGIRHarvesterFactory
1583 BioCASEHarvesterFactory
966 DwCArchiveHarvesterFactory
312 TapirHarvesterFactory

Importing into my excel I have problems interpreting dates. I also notice a handful of rows that break probably due to line breaking characters. Therefore, I attach the original txt file for your own processing.






 
    


Author: lfrancke@gbif.org
Created: 2013-01-10 18:07:18.162
Updated: 2013-01-10 18:07:18.162
        
Thanks [~kbraak@gbif.org], unfortunately I'm running into the same problems with not being able to process the data. Could you tell me how you generated the report?

I need one with only a few columns: data resource id, target records, max harvested records, harvested records, dropped records, date last harvested, raw occurrence record count, occurrence record count

That should not include any text columns and thus be a bit easier to process.