How's that for putting the search terms right there in the post title?
I wrote a little ZooKeeper program, very simple, basically just the lock recipe out of the ZooKeeper examples. You know, the one where you create a sequenced child node inside the parent container node, and your child node represents your position in the line of tasks waiting to acquire the lock.
My program seemed to be working quite well, I was pleased. But once in a long while it would fail, with an error message stating that it got a "no node" error when trying to create a sequenced child node.
Impossible! I said, for I knew that I had successfully created such a node just a few moments earlier, and I definitely hadn't run any of my own code which deleted the lock container node. So why wasn't it there?
After a while, I found a line in my ZooKeeper leader's log:
[ContainerManagerTask:o.a.z.s.ContainerManager@135] - Attempting to delete candidate container: /path/to/lock/node
That led me to the ContainerManager documentation, which in turn reminded me to check this note in the ZooKeeper documentation, which described my mistake precisely:
Container Nodes
Added in 3.5.3
ZooKeeper has the notion of container znodes. Container znodes are special purpose znodes useful for recipes such as leader, lock, etc. When the last child of a container is deleted, the container becomes a candidate to be deleted by the server at some point in the future.
Given this property, you should be prepared to get KeeperException.NoNodeException when creating children inside of container znodes. i.e. when creating child znodes inside of container znodes always check for KeeperException.NoNodeException and recreate the container znode when it occurs.
On my system, ContainerManager seems to make this check about once a minute, which meant that every so often, after enough minutes and enough use of my program, ContainerManager would delete the lock node just when I was about to try to create a new sequenced child node.
Voila!