X-Git-Url: https://gerrit.o-ran-sc.org/r/gitweb?a=blobdiff_plain;f=XTesting%2Fkubespray%2Fdocs%2Flarge-deployments.md;fp=XTesting%2Fkubespray%2Fdocs%2Flarge-deployments.md;h=7acbff3d15eac627d01fe73d404e55756558eba1;hb=31af17bb5935b722dcf59d5800aaff9e789cfa93;hp=0000000000000000000000000000000000000000;hpb=c8bda4f07b7e87beb2aa3d8729f9b0b456d4da6f;p=it%2Ftest.git diff --git a/XTesting/kubespray/docs/large-deployments.md b/XTesting/kubespray/docs/large-deployments.md new file mode 100644 index 0000000..7acbff3 --- /dev/null +++ b/XTesting/kubespray/docs/large-deployments.md @@ -0,0 +1,52 @@ +Large deployments of K8s +======================== + +For a large scaled deployments, consider the following configuration changes: + +* Tune [ansible settings](https://docs.ansible.com/ansible/latest/intro_configuration.html) + for `forks` and `timeout` vars to fit large numbers of nodes being deployed. + +* Override containers' `foo_image_repo` vars to point to intranet registry. + +* Override the ``download_run_once: true`` and/or ``download_localhost: true``. + See download modes for details. + +* Adjust the `retry_stagger` global var as appropriate. It should provide sane + load on a delegate (the first K8s control plane node) then retrying failed + push or download operations. + +* Tune parameters for DNS related applications + Those are ``dns_replicas``, ``dns_cpu_limit``, + ``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``. + Please note that limits must always be greater than or equal to requests. + +* Tune CPU/memory limits and requests. Those are located in roles' defaults + and named like ``foo_memory_limit``, ``foo_memory_requests`` and + ``foo_cpu_limit``, ``foo_cpu_requests``. Note that 'Mi' memory units for K8s + will be submitted as 'M', if applied for ``docker run``, and cpu K8s units + will end up with the 'm' skipped for docker as well. This is required as + docker does not understand k8s units well. + +* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet. + ``kube_controller_node_monitor_grace_period``, + ``kube_controller_node_monitor_period``, + ``kube_apiserver_pod_eviction_not_ready_timeout_seconds`` & + ``kube_apiserver_pod_eviction_unreachable_timeout_seconds`` for better Kubernetes reliability. + Check out [Kubernetes Reliability](/docs/kubernetes-reliability.md) + +* Tune network prefix sizes. Those are ``kube_network_node_prefix``, + ``kube_service_addresses`` and ``kube_pods_subnet``. + +* Add calico_rr nodes if you are deploying with Calico or Canal. Nodes recover + from host/network interruption much quicker with calico_rr. + +* Check out the + [Inventory](/docs/getting-started.md#building-your-own-inventory) + section of the Getting started guide for tips on creating a large scale + Ansible inventory. + +* Override the ``etcd_events_cluster_setup: true`` store events in a separate + dedicated etcd instance. + +For example, when deploying 200 nodes, you may want to run ansible with +``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.