Lossless restart and deploy

Lossless restart

When the Agent is ready to exit due to receiving a SIGTERM signal, it saves its task execution state to $CWD/data/state (usually /usr/local/holoinsight/agent/data/state).

When the Agent starts, it tries to load $CWD/data/state (valid for 2 minutes from creation) to restore task state.

Lossless deploy/upgrade

Config maxSurge>0 in k8s yaml:

  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

After the agent starts, it will listen to $CWD/data/transfer.sock.

When holoinsight-agent is upgraded using maxSurge mode through k8s, k8s will first create a new pod, and then delete the old pod after the new pod is ready.

When the new pod starts, it will try to connect $CWD/data/transfer.sock to the old pod for state transfer. After the state transfer is completed, the new pod starts to work normally, and the old pod is still alive but not working, waiting for k8s to reclaim resources.