Issue 15614

BioCASE metasync throws HTTP 422 exception

15614
Reporter: jlegind
Assignee: jlegind
Type: Task
Summary: BioCASE metasync throws HTTP 422 exception
Priority: Major
Status: InProgress
Created: 2014-05-14 12:40:35.972
Updated: 2014-05-26 12:54:59.314
        
Description: The Senckenberg installation http://registry.gbif.org/web/index.html#/installation/60383ab8-f762-11e1-a439-00145eb45e9a has trouble finishing sync. due to a HTTP 422 error

Here is the stack trace:
{quote}
INFO  [2014-05-12 17:23:45,531+0200] [pool-4-thread-25] org.gbif.crawler.registry.metasync.MetasyncService: Done syncing. Processing result.
INFO  [2014-05-12 17:23:45,532+0200] [pool-4-thread-25] org.gbif.registry.metasync.resulthandler.DebugHandler: Installation [60383ab8-f762-11e1-a439-00145eb45e9a] synced successfully. [7] added, [2] deleted, [133] updated
WARN  [2014-05-12 17:23:45,724+0200] [pool-4-thread-25] org.gbif.common.messaging.MessageConsumer: Error handling message, will be acknowledged anyway and not retried
javax.validation.ValidationException: com.sun.jersey.api.client.UniformInterfaceException: HTTP 422: 
        at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:74) ~[crawler-cli.jar:na]
        at org.gbif.registry.ws.client.BaseNetworkEntityClient.create(BaseNetworkEntityClient.java:29) ~[crawler-cli.jar:na]
        at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:47) ~[crawler-cli.jar:na]
        at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveAddedDatasets(RegistryUpdater.java:199) ~[crawler-cli.jar:na]
        at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveSyncResults(RegistryUpdater.java:147) ~[crawler-cli.jar:na]
        at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveSyncResultsToRegistry(RegistryUpdater.java:43) ~[crawler-cli.jar:na]
        at org.gbif.crawler.registry.metasync.MetasyncService$MetasyncCallback.handleMessage(MetasyncService.java:88) ~[crawler-cli.jar:na]
        at org.gbif.crawler.registry.metasync.MetasyncService$MetasyncCallback.handleMessage(MetasyncService.java:63) ~[crawler-cli.jar:na]
        at org.gbif.common.messaging.MessageConsumer.handleCallback(MessageConsumer.java:101) [crawler-cli.jar:na]
        at org.gbif.common.messaging.MessageConsumer.handleDelivery(MessageConsumer.java:65) [crawler-cli.jar:na]
        at com.rabbitmq.client.impl.ConsumerDispatcher$4.run(ConsumerDispatcher.java:121) ~[crawler-cli.jar:na]
        at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:76) ~[crawler-cli.jar:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_25]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_25]
        at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Caused by: com.sun.jersey.api.client.UniformInterfaceException: HTTP 422: 
        at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:59) ~[crawler-cli.jar:na]
        ... 14 common frames omitted {quote}

From http://tools.ietf.org/html/rfc4918#section-11.2 :

11.2. 422 Unprocessable Entity

   The 422 (Unprocessable Entity) status code means the server
   understands the content type of the request entity (hence a
   415(Unsupported Media Type) status code is inappropriate), and the
   syntax of the request entity is correct (thus a 400 (Bad Request)
   status code is inappropriate) but was unable to process the contained
   instructions.  For example, this error condition may occur if an XML
   request body contains well-formed (i.e., syntactically correct), but
   semantically erroneous, XML instructions.

]]>
    


Author: kbraak@gbif.org
Created: 2014-05-15 15:46:26.211
Updated: 2014-05-15 15:46:26.211
        
The problem happens during synchronization of http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=SGN

One or more datasets has a homepage that doesn't start with "http" or "https". Dataset creation fails since there is @HttpURI on Dataset.homepage, which "validates that the URI field is absolute, beginning with either http or https."

See: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/model/registry/Dataset.java#L353

We could:

1) Remove the @HttpURI on Dataset.homepage
2) During synchronization, add default http protocol to homepage when protocol is missing, e.g. "http://" + "www.senk..."
3) Contact the publisher to update the dataset homepage URLs

    


Author: kbraak@gbif.org
Created: 2014-05-15 16:04:01.387
Updated: 2014-05-15 16:04:01.387
        
We also need to address the fact that our last metadata synchronization was about to delete 129 our of 135 datasets!

"Synchronization succeeded. 0 datasets were updated. 6 datasets were added. 129 datasets were deleted."

This is probably occurring because the search request against the Endpoints (to get all the Metadata for a single Dataset) times out. An example search request against http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=SGN takes about 1 minute.

What is the current timeout? Do we need to give it more time? 
    


Author: kbraak@gbif.org
Comment: Looks like current timeout is actually 120 minutes, so this must not be the issue. See the MetasyncCallback inside org.gbif.crawler.registry.metasync.MetasyncService.
Created: 2014-05-15 16:24:02.665
Updated: 2014-05-15 16:24:02.665


Author: kbraak@gbif.org
Created: 2014-05-19 16:24:03.625
Updated: 2014-05-19 16:24:23.94
        
We're going to go with 3)

[~jlegind@gbif.org] can you please contact the publisher, and make sure that their website URLs and logo URLs (if they have any) start with http:// ? Currently they start with www. only

For ABCD 2.0.6, we're talking about the following terms:

DATASET.LOGO.URL
*/DataSet/Metadata/Owners/Owner/LogoURI
DATASET.WEBSITE.URL
*/DataSets/DataSet/Metadata/Description/Representation/URI
*/DataSets/DataSet/Metadata/Description/Owners/Owner/URIs/URL
    


Author: jlegind@gbif.org
Comment: Publisher contacted
Created: 2014-05-26 12:54:59.314
Updated: 2014-05-26 12:54:59.314