XTesting/kubespray/docs/kubernetes-reliability.md

   1 # Overview
   2
   3 Distributed system such as Kubernetes are designed to be resilient to the
   4 failures.  More details about Kubernetes High-Availability (HA) may be found at
   5 [Building High-Availability Clusters](https://kubernetes.io/docs/admin/high-availability/)
   6
   7 To have a simple view the most of parts of HA will be skipped to describe
   8 Kubelet<->Controller Manager communication only.
   9
  10 By default the normal behavior looks like:
  11
  12 1. Kubelet updates it status to apiserver periodically, as specified by
  13    `--node-status-update-frequency`. The default value is **10s**.
  14
  15 2. Kubernetes controller manager checks the statuses of Kubelet every
  16    `–-node-monitor-period`. The default value is **5s**.
  17
  18 3. In case the status is updated within `--node-monitor-grace-period` of time,
  19    Kubernetes controller manager considers healthy status of Kubelet. The
  20    default value is **40s**.
  21
  22 > Kubernetes controller manager and Kubelet work asynchronously. It means that
  23 > the delay may include any network latency, API Server latency, etcd latency,
  24 > latency caused by load on one's control plane nodes and so on. So if
  25 > `--node-status-update-frequency` is set to 5s in reality it may appear in
  26 > etcd in 6-7 seconds or even longer when etcd cannot commit data to quorum
  27 > nodes.
  28
  29 ## Failure
  30
  31 Kubelet will try to make `nodeStatusUpdateRetry` post attempts. Currently
  32 `nodeStatusUpdateRetry` is constantly set to 5 in
  33 [kubelet.go](https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/kubelet.go#L102).
  34
  35 Kubelet will try to update the status in
  36 [tryUpdateNodeStatus](https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/kubelet_node_status.go#L312)
  37 function. Kubelet uses `http.Client()` Golang method, but has no specified
  38 timeout. Thus there may be some glitches when API Server is overloaded while
  39 TCP connection is established.
  40
  41 So, there will be `nodeStatusUpdateRetry` * `--node-status-update-frequency`
  42 attempts to set a status of node.
  43
  44 At the same time Kubernetes controller manager will try to check
  45 `nodeStatusUpdateRetry` times every `--node-monitor-period` of time. After
  46 `--node-monitor-grace-period` it will consider node unhealthy.  Pods will then be rescheduled based on the
  47 [Taint Based Eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions)
  48 timers that you set on them individually, or the API Server's global timers:`--default-not-ready-toleration-seconds` &
  49 ``--default-unreachable-toleration-seconds``.
  50
  51 Kube proxy has a watcher over API. Once pods are evicted, Kube proxy will
  52 notice and will update iptables of the node. It will remove endpoints from
  53 services so pods from failed node won't be accessible anymore.
  54
  55 ## Recommendations for different cases
  56
  57 ## Fast Update and Fast Reaction
  58
  59 If `--node-status-update-frequency` is set to **4s** (10s is default).
  60 `--node-monitor-period` to **2s** (5s is default).
  61 `--node-monitor-grace-period` to **20s** (40s is default).
  62 `--default-not-ready-toleration-seconds` and ``--default-unreachable-toleration-seconds`` are set to **30**
  63 (300 seconds is default).  Note these two values should be integers representing the number of seconds ("s" or "m" for
  64 seconds\minutes are not specified).
  65
  66 In such scenario, pods will be evicted in **50s** because the node will be
  67 considered as down after **20s**, and `--default-not-ready-toleration-seconds` or
  68 ``--default-unreachable-toleration-seconds`` occur after **30s** more.  However, this scenario creates an overhead on
  69 etcd as every node will try to update its status every 2 seconds.
  70
  71 If the environment has 1000 nodes, there will be 15000 node updates per
  72 minute which may require large etcd containers or even dedicated nodes for etcd.
  73
  74 > If we calculate the number of tries, the division will give 5, but in reality
  75 > it will be from 3 to 5 with `nodeStatusUpdateRetry` attempts of each try. The
  76 > total number of attempts will vary from 15 to 25 due to latency of all
  77 > components.
  78
  79 ## Medium Update and Average Reaction
  80
  81 Let's set `--node-status-update-frequency` to **20s**
  82 `--node-monitor-grace-period` to **2m** and `--default-not-ready-toleration-seconds` and
  83 ``--default-unreachable-toleration-seconds`` to **60**.
  84 In that case, Kubelet will try to update status every 20s. So, it will be 6 * 5
  85 = 30 attempts before Kubernetes controller manager will consider unhealthy
  86 status of node. After 1m it will evict all pods. The total time will be 3m
  87 before eviction process.
  88
  89 Such scenario is good for medium environments as 1000 nodes will require 3000
  90 etcd updates per minute.
  91
  92 > In reality, there will be from 4 to 6 node update tries. The total number of
  93 > of attempts will vary from 20 to 30.
  94
  95 ## Low Update and Slow reaction
  96
  97 Let's set `--node-status-update-frequency` to **1m**.
  98 `--node-monitor-grace-period` will set to **5m** and `--default-not-ready-toleration-seconds` and
  99 ``--default-unreachable-toleration-seconds`` to **60**. In this scenario, every kubelet will try to update the status
 100 every minute. There will be 5 * 5 = 25 attempts before unhealthy status. After 5m,
 101 Kubernetes controller manager will set unhealthy status. This means that pods
 102 will be evicted after 1m after being marked unhealthy. (6m in total).
 103
 104 > In reality, there will be from 3 to 5 tries. The total number of attempt will
 105 > vary from 15 to 25.
 106
 107 There can be different combinations such as Fast Update with Slow reaction to
 108 satisfy specific cases.