Issue 17240

failed deploy leaves multiple services in zk

17240
Reporter: omeyn
Assignee: mblissett
Type: Bug
Summary: failed deploy leaves multiple services in zk
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2015-02-17 11:01:15.01
Updated: 2017-10-06 15:24:37.124
Resolved: 2017-10-06 15:24:37.106
        
Description: Trying to do full deploy of releases to uat. It failed in geocode-ws when moving logs with "file changed as we were trying to archive it.

stderr: tar: ./1423658554/logs/geocode-ws.log: file changed as we read it
stdout: ./1423658554/logs/geocode-ws-0.6-1423658554.out
./1423658554/logs/geocode-ws.2015-02-11.log.gz
After failure there were multiple services in zk:

[zk: prodmaster1-vh(CONNECTED) 3] ls /uat/services/geocode-ws
[81f10334-0372-4417-ad9e-a60ced3ece7b, 4317dc2d-8c8f-4bfb-94f7-6e7b13ea3d57]
[zk: prodmaster1-vh(CONNECTED) 4] ls /uat/services/registry-ws
[4ba281d3-bdf0-4f03-87c7-3a2688f9c07c, 7c71b254-b21d-4722-b97f-ec790e3f2b91]

I'd expect a single service.

Also there were many timestamped dirs in the services dir for geocode, eg:

[root@uatapps2 geocode-ws]# ls 0.6/
1422526520/  1422527886/  1422536088/  1422874286/  1423053841/  1423658157/  1423658333/  1423658554/  1424165701/  1424166078/  1424166135/
]]>
    


Author: mblissett
Created: 2016-02-17 15:45:00.974
Updated: 2016-02-17 15:45:00.974
        
(Exactly a year since this issue was made.)

I think this was fixed with my changes here: https://github.com/gbif/c-deploy/commit/774127703682259f2d00a27d16473341c5e08785 to tolerate failure of the Curl call asking the service to stop. If I understand that (and I haven't looked much into it), that would ask the service to unregister from ZooKeeper?  So, if the service isn't able to do that (it's crashed) we could detect that and remove it from ZK ourselves, or just wait for it to time out anyway.

The fix should mean it's no longer possible for the stop script to fail to kill the process.