15614
Reporter: jlegind
Assignee: jlegind
Type: Task
Summary: BioCASE metasync throws HTTP 422 exception
Priority: Major
Status: InProgress
Created: 2014-05-14 12:40:35.972
Updated: 2014-05-26 12:54:59.314
Description: The Senckenberg installation http://registry.gbif.org/web/index.html#/installation/60383ab8-f762-11e1-a439-00145eb45e9a has trouble finishing sync. due to a HTTP 422 error
Here is the stack trace:
{quote}
INFO [2014-05-12 17:23:45,531+0200] [pool-4-thread-25] org.gbif.crawler.registry.metasync.MetasyncService: Done syncing. Processing result.
INFO [2014-05-12 17:23:45,532+0200] [pool-4-thread-25] org.gbif.registry.metasync.resulthandler.DebugHandler: Installation [60383ab8-f762-11e1-a439-00145eb45e9a] synced successfully. [7] added, [2] deleted, [133] updated
WARN [2014-05-12 17:23:45,724+0200] [pool-4-thread-25] org.gbif.common.messaging.MessageConsumer: Error handling message, will be acknowledged anyway and not retried
javax.validation.ValidationException: com.sun.jersey.api.client.UniformInterfaceException: HTTP 422:
Validation of [homepage] failed:
at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:74) ~[crawler-cli.jar:na]
at org.gbif.registry.ws.client.BaseNetworkEntityClient.create(BaseNetworkEntityClient.java:29) ~[crawler-cli.jar:na]
at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:47) ~[crawler-cli.jar:na]
at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveAddedDatasets(RegistryUpdater.java:199) ~[crawler-cli.jar:na]
at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveSyncResults(RegistryUpdater.java:147) ~[crawler-cli.jar:na]
at org.gbif.registry.metasync.resulthandler.RegistryUpdater.saveSyncResultsToRegistry(RegistryUpdater.java:43) ~[crawler-cli.jar:na]
at org.gbif.crawler.registry.metasync.MetasyncService$MetasyncCallback.handleMessage(MetasyncService.java:88) ~[crawler-cli.jar:na]
at org.gbif.crawler.registry.metasync.MetasyncService$MetasyncCallback.handleMessage(MetasyncService.java:63) ~[crawler-cli.jar:na]
at org.gbif.common.messaging.MessageConsumer.handleCallback(MessageConsumer.java:101) [crawler-cli.jar:na]
at org.gbif.common.messaging.MessageConsumer.handleDelivery(MessageConsumer.java:65) [crawler-cli.jar:na]
at com.rabbitmq.client.impl.ConsumerDispatcher$4.run(ConsumerDispatcher.java:121) ~[crawler-cli.jar:na]
at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:76) ~[crawler-cli.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_25]
at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Caused by: com.sun.jersey.api.client.UniformInterfaceException: HTTP 422:
Validation of [homepage] failed:
at org.gbif.ws.client.interceptor.HttpErrorResponseInterceptor.invoke(HttpErrorResponseInterceptor.java:59) ~[crawler-cli.jar:na]
... 14 common frames omitted {quote}
From http://tools.ietf.org/html/rfc4918#section-11.2 :
11.2. 422 Unprocessable Entity
The 422 (Unprocessable Entity) status code means the server
understands the content type of the request entity (hence a
415(Unsupported Media Type) status code is inappropriate), and the
syntax of the request entity is correct (thus a 400 (Bad Request)
status code is inappropriate) but was unable to process the contained
instructions. For example, this error condition may occur if an XML
request body contains well-formed (i.e., syntactically correct), but
semantically erroneous, XML instructions.
]]>
Author: kbraak@gbif.org
Created: 2014-05-15 15:46:26.211
Updated: 2014-05-15 15:46:26.211
The problem happens during synchronization of http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=SGN
One or more datasets has a homepage that doesn't start with "http" or "https". Dataset creation fails since there is @HttpURI on Dataset.homepage, which "validates that the URI field is absolute, beginning with either http or https."
See: https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/model/registry/Dataset.java#L353
We could:
1) Remove the @HttpURI on Dataset.homepage
2) During synchronization, add default http protocol to homepage when protocol is missing, e.g. "http://" + "www.senk..."
3) Contact the publisher to update the dataset homepage URLs
Author: kbraak@gbif.org
Created: 2014-05-15 16:04:01.387
Updated: 2014-05-15 16:04:01.387
We also need to address the fact that our last metadata synchronization was about to delete 129 our of 135 datasets!
"Synchronization succeeded. 0 datasets were updated. 6 datasets were added. 129 datasets were deleted."
This is probably occurring because the search request against the Endpoints (to get all the Metadata for a single Dataset) times out. An example search request against http://biocase.senckenberg.de/biocase/pywrapper.cgi?dsa=SGN takes about 1 minute.
What is the current timeout? Do we need to give it more time?
Author: kbraak@gbif.org
Comment: Looks like current timeout is actually 120 minutes, so this must not be the issue. See the MetasyncCallback inside org.gbif.crawler.registry.metasync.MetasyncService.
Created: 2014-05-15 16:24:02.665
Updated: 2014-05-15 16:24:02.665
Author: kbraak@gbif.org
Created: 2014-05-19 16:24:03.625
Updated: 2014-05-19 16:24:23.94
We're going to go with 3)
[~jlegind@gbif.org] can you please contact the publisher, and make sure that their website URLs and logo URLs (if they have any) start with http:// ? Currently they start with www. only
For ABCD 2.0.6, we're talking about the following terms:
DATASET.LOGO.URL
*/DataSet/Metadata/Owners/Owner/LogoURI
DATASET.WEBSITE.URL
*/DataSets/DataSet/Metadata/Description/Representation/URI
*/DataSets/DataSet/Metadata/Description/Owners/Owner/URIs/URL