Issue 10169

Decide on Etag vs Last Modified for caching response headers

10169
Reporter: omeyn
Assignee: jcuadra
Type: Epic
Summary: Decide on Etag vs Last Modified for caching response headers
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2011-11-08 09:58:04.507
Updated: 2013-12-09 14:01:18.922
Resolved: 2011-11-08 10:00:15.017


Author: omeyn@gbif.org
Created: 2011-11-08 09:58:41.419
Updated: 2011-11-08 09:58:41.419
        
Jose Cuadra Mon, 19 Sep at 8:01am
For client side caching, there are two options worth considering to be used:
Last-Modified identifier: A client obtains the response for the object requested. The server appends the "Last-Modified" field to the response message. Subsequent requests from the client (to obtain the same resource) might append the "If-Modified-Since" property with the date received in the previous response. If the object has not been modified on the data storage, then return a 304 status code.
Possible drawbacks: Client and server's clocks might have a huge difference which means the client might end up with a stale object.
-----
GET /portalapi/someresource HTTP/1.1
Accept: */*
...
If-Modified-Since: Mon, 19 Sep 2011 12:00:00 GMT
...
Host: 127.0.0.1
-----

ETag identifier: 1) the ETag might be used to avoid this date problem, by building up a hash of the resource's content. 2) If we decide on using revision numbers for tracking changes on our resources (multiversion concurrency control), then the ETag might just contain the revision number.
In any case, if ETags are the same, then also a 304 status code will be returned.
-----
GET /portalapi/someresource HTTP/1.1
Accept: */*
...
If-None-Match: some hash (e.g. bdf63deecafa09bece92f55510e62aeb)
...
Host: 127.0.0.1
-----
In my opinion, the ETag field provides more flexibiliy and avoids all these timestamp problems. The only requirement I see is that we will need to come up with a list of the fields that uniquely identify each resource (e.g. for Organisation = uuid, name, description, homepage, etc), and also whether to store the hashcode in the DB or just calculate it each time on the fly.  (Or the revision number if we decide upon this)

An example might look something like:
http://tescherm.com/blog/2010/07/18/http-caching-with-jersey-jax-rs/
without the last-modified-bit
Combination of both: We might also use both identifiers as well (ETag and Last-modified) but for me this is an overkill. So I personally do not like this option, but just writing it to cover all bases.

Jose Cuadra Mon, 19 Sep at 8:34am

Lars Francke Mon, 19 Sep at 8:44am
Thank you very much. Very nice summary of the options.
How is the support for these in Jersey?
And any idea of client support for these two? I think ETags are by now supported by most clients but not sure.
About your comment with the unique fields per resource: We need that anyway for proper equals and hashcode methods that are used elsewhere as well so this should be a non-issue
From reading this it seems like we should support both were easily possible (i.e. where we have a "lost modified date" we should be able to easily support Last-Modified) and always support ETag in a way that depends on the outcome of your research for the multiversion thing but only advertise or document ETag perhaps?

Markus Döring Mon, 19 Sep at 9:07am
Thanks jose,
Id like to get more insight into client consequences.
do all browsers support both variants (do we care about browsers directly anyway?)
with etags you need to store the last tag client side, while with modified dates this can be handled with the modified timestamp of in the filesystem in case the response is kept there. For example out httpUtils class supports reading the last modified date from the filesystem in case of conditional downloads. Would every object also keep the ETag as a property or how would we keep track of this?

Markus Döring Mon, 19 Sep at 10:03am
Interestingly the REST plugin for struts2 uses both ETags and last modified.
And they also provide a class that we can look at as a template for dealing with this:
http://svn.apache.org/viewvc/struts/struts2/trunk/plugins/rest/src/main/java/org/apache/struts2/rest/DefaultHttpHeaders.java?view=markup

Jose Cuadra Mon, 19 Sep at 10:28am
Lars:
From the examples I've seen Jersey handles pretty nicely caching via these two approaches. The URL I copied has a nice example on this.
From the support perspective, the only drawback I see is ETags is not supported in HTTP/1.0 as it was introduced in 1.1
About the support for both, I agree with this if we want to be aligned with the spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13 --- 13.3.4 Rules for When to Use Entity Tags and Last-Modified Dates) which state what you said that we should support both of them, so I guess you had already read it.
Markus:
I think we don't care much about browsers in our web services. Anything happening through browsers will not involve heavy requests, but just some person checking quickly some data, for which cache is not necessary. You are right on the ETag value needed to be stored client side, and I overlooked this problem. Will need to look at Jersey's client API if there's any caching support or handcraft the cache support ourselves.
For your modified dates approach, every response will need to be stored on the client side's filesystem and following a directory structure (e.g. portal_responses/organization/xxx , portal_responses/name_usage/xxx), I think. I would rather send the current timestamp, and try to come up with a better plan to avoid huge discrepancies between timestamps, from the top of my mind:
   -  Optionally sending the client's current timestamp with all POST requests and saving it as "modified" on the DB (assuming only our client at GBIF is the one allowed to create new resources)
   - Make the client request the server's current timestamp and attach this to the Request
   - will add more possible strategies later...

Markus Döring Mon, 19 Sep at 10:49am
ja, we don't need to worry about browsers (it was mentioned before, thats why it came to my mind).
I dont understand your current timestamp approach. That doesn't make sense to me as you want to know if sth has changed since you last requested it. Passing now would always get you a 304, wouldn't it? To avoid clock differences one could send back the modified date from the object if it has one as this would get (probably) managed by the server

Jose Cuadra Tue, 20 Sep at 3:39am
Markus, yes I messed up on my wording there and your idea is good.
So, to be aligned with the spec and taking your considerations on hand, I propose the following
Return an ETag and/or Last-Modified whenever possible (taking back my "overkill" comment on the first post, after reading the HTTP/1.1 specs).   (*)
As Markus wrote, use the modified date from the object returned (which is managed by the server)
ETag will be a hash of the object's most relevant fields
Look into any built-in caching support for Jersey's client API (any ideas on this?) - or use a third party library to handle caching. But this is more client implementation and getting out of this TODO's scope.
(*) For best practices, we stick to section "13.3.4 Rules for When to Use Entity Tags and Last-Modified Dates" from the HTTP/1.1 spec on how to handle ETag&Last-Modified on the client and server side.
Roughly from the specs:
HTTP/1.1 origin servers:
SHOULD send an entity tag validator unless it is not feasible to generate one.
MAY send a weak entity tag instead of a strong entity tag, if performance considerations support the use of weak entity tags, or if it is unfeasible to send a strong entity tag.
SHOULD send a Last-Modified value if it is feasible to send one, unless the risk of a breakdown in semantic transparency that could result from using this date in an If-Modified-Since header would lead to serious problems.
HTTP/1.1 clients:
If an entity tag has been provided by the origin server, MUST use that entity tag in any cache-conditional request (using If- Match or If-None-Match).
If only a Last-Modified value has been provided by the origin server, SHOULD use that value in non-subrange cache-conditional requests (using If-Modified-Since).
If only a Last-Modified value has been provided by an HTTP/1.0 origin server, MAY use that value in subrange cache-conditional requests (using If-Unmodified-Since:). The user agent SHOULD provide a way to disable this, in case of difficulty.
If both an entity tag and a Last-Modified value have been provided by the origin server, SHOULD use both validators in cache-conditional requests. This allows both HTTP/1.0 and HTTP/1.1 caches to respond appropriately.
I might have overlooked something, will be glad to read more suggestions or comments

Markus Döring Tue, 20 Sep at 3:52am
sounds good.
Any reasons not to simply use the Objects hash() method for the ETag? The struts2 rest plugin linked above does use that too