Part II: Kubernetes - High Availability
Create a Master Node and two Worker Nodes
Before You Begin
Please review this document thoroughly before executing any of the instructions. Many errors and troubleshooting details are provided that may save you time and excessive effort.
High Availability (HA)
One way Kubernetes offers high availability for workloads and applications is by using multiple master (control plane) nodes with multiple worker nodes so there is no single point of failure. Using a load balancer in front of the master nodes ensures traffic flows across all control planes, and if one of the master node fails all traffic gets directed another master.
Another way to achieve high availability is by placing one or more master node components, for instance the cluster database (Etcd), on multiple servers, and establish intercommunication among all Etcd processes so data cluster can be replicated between them. If one Etcd process fails, there is no loss of cluster configurations.
Both of these methods can be used together to provide increased reliability.
Two Workers and a Master Node
To get hands-on experience with high availability, these instructions detail how to set up a master node with two worker nodes using Ubuntu servers. This document does not provide steps for creating those servers. Local virtualization software or cloud provider compute instances work.
A minimum of two vCPUs and two GB RAM is required, but plan to use more resources for better performance. The cluster will not start if it detects lower cpu and memory availability.
Configure the Master Node
Log into the server that will serve as the control plane with an account that has sudo access.
Follow these installation procedures.
Install Docker
sudo apt update -y && sudo apt upgrade -y
sudo apt-get install -y docker.io
sudo apt-get install -y apt-transport-https curl
Install Kubernetes Components
sudo curl https://packages.cloud.google.com/apt/doc/apt-key.gpg --output /etc/apt/trusted.gpg.d/k8s-apt-key.gpg # sudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main" -ysudo apt update
sudo apt install kubeadm kubelet kubectl -y
Errors Received While Installing the Kubernetes Components
I received these errors every time I attempted to install the kube* tools.
Repository: 'deb http://apt.kubernetes.io/ kubernetes-xenial main'Description:Archive for codename: kubernetes-xenial components: mainE: Conflicting values set for option Trusted regarding source http://apt.kubernetes.io/ kubernetes-xenialE: The list of sources could not be read.E: Conflicting values set for option Trusted regarding source http://apt.kubernetes.io/ kubernetes-xenialE: The list of sources could not be read.E: Conflicting values set for option Trusted regarding source http://apt.kubernetes.io/ kubernetes-xenialE: The list of sources could not be read.Failed to install kubeadm kubelet kubectl. Please retry
To resolve this issue, remove the contents of these two files, and reinstall.
/etc/apt/sources.list.d/kubernetes.list/etc/apt/sources.list.d/archive_uri-http_apt_kubernetes_io_-jammy.list
Disable and Turn Off Swap
swapoff -a && sed -i '/swap/d' /etc/fstab
Initialize the Cluster
Documentation on how to create a cluster with kubeadm can be found at this link.
Initialize a cluster using the following command.
sudo kubeadm init --apiserver-advertise-address=<master node IP address> --pod-network-cidr=<pod network ip>
where apiserver-advertise-address is the ip address of the master node (use the default network ip address).
The “ -pod-network-cidr” and “-apiserver-advertise-address’ statements are optional.
Make a note of the kubeadm join command that is listed at the end of the command’s output. It will be needed to join worker nodes to the cluster.
Execute these three commands to complete the installation.
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Troubleshooting Note
If these errors are generated,
[ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR Port-2379]: Port 2379 is in use [ERROR Port-2380]: Port 2380 is in use
reset the cluster.
kubeadm reset
Run the kubeadm init command again.
If you are using the root account, run this command.
export KUBECONFIG=/etc/kubernetes/admin.conf
A High Level Look at the Kubernetes Network Model
Detailed information about Services, Load Balancing, and Networking can be found at this link.
The documentation found at that link notes:
* pods can communicate with all other pods on any other node without NAT
* agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
It also states:
Kubernetes networking addresses four concerns:
Containers within a Pod use networking to communicate via loopback.
Cluster networking provides communication between different Pods.
The Service resource lets you expose an application running in Pods to be reachable from outside your cluster.
You can also use Services to publish services only for consumption inside your cluster.
This document focuses establishing network connectivity for pods running on different worker nodes by using a Container Network Interface (CNI), and more specifically, by using Calico.
Calico is “an open source networking and network security solution for containers, virtual machines, and native host-based workloads”.
For full instructions on how to install Calico on public cloud, managed public cloud, or on-premises, use this link.
Install the Calico network plugin.
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Show all namespaces within the cluster, and verify all have the “Running” status.
kubectl get pods --all-namespaces
Configuring the Worker Nodes
Log into the first worker node.
Install Docker
sudo apt update -y && sudo apt upgrade -y
sudo apt-get install -y docker.io
sudo apt-get install -y apt-transport-https curl
Install Kubernetes Components
sudo curl https://packages.cloud.google.com/apt/doc/apt-key.gpg --output /etc/apt/trusted.gpg.d/k8s-apt-key.gpg # This line was apt-keysudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main" -ysudo apt update
sudo apt install kubeadm kubelet kubectl -y
sudo apt-mark hold kubelet kubeadm kubectl
A Quick Note
If you are using cloud compute instances, take a snapshot of these servers in case a need arises to revert to the original configuration.
Join the Worker Node to the Cluster
This requires the kubeadm join command previously mentioned during the cluster creation section.
Find kubeadm join command, and execute it.
sudo kubeadm join 172.31.8.229:6443 --token 84k88a.0vtnmokcliaeoobv \
--discovery-token-ca-cert-hash sha256:a1f63387a5992688f729958d2571415dc7a38a5cbd1ae75b363e17253ee913be
If the original token provided at the end of the cluster installation is not available, create a new token using this command.
sudo kubeadm token create --print-join-command
Repeat the above steps on each worker node.
On the master node, verify the worker nodes have joined the cluster.
kubectl get nodes
Time for a Reality Check
This article has been delayed for several days?
Why?
My cluster would start, be responsive, and then stop running. After creating multiple new environments and spending hours researching and troubleshooting, there was still no progress.
The current /var/log/syslog file was full of errors.
Here is a small sample of the errors.
Jun 16 15:31:25 localhost kubelet[453]: E0616 15:31:25.400055 453 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-flannel\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=kube-flannel pod=kube-flannel-ds-lg5q5_kube-system(d1365521-0b9c-43bb-b6b1-5f7324ba5466)\"" pod="kube-system/kube-flannel-ds-lg5q5" podUID=d1365521-0b9c-43bb-b6b1-5f7324ba5466Jun 16 15:31:26 localhost containerd[492]: time="2022-06-16T15:31:26.399268850Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:coredns-6d4b75cb6d-vsv27,Uid:66fee7e4-47ac-4e6e-b461-2058192f5ea3,Namespace:kube-system,Attempt:0,}"Jun 16 15:31:26 localhost containerd[492]: time="2022-06-16T15:31:26.448132869Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:coredns-6d4b75cb6d-vsv27,Uid:66fee7e4-47ac-4e6e-b461-2058192f5ea3,Namespace:kube-system,Attempt:0,} failed, error" error="failed to setup network for sandbox \"09dac84b2502cb73358e9822ae55b71cb45e2417696b12f70ad8ba47b547f539\": open /run/flannel/subnet.env: no such file or directory"Jun 16 15:31:26 localhost kubelet[453]: E0616 15:31:26.449344 453 remote_runtime.go:212] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"09dac84b2502cb73358e9822ae55b71cb45e2417696b12f70ad8ba47b547f539\": open /run/flannel/subnet.env: no such file or directory"Jun 16 15:31:26 localhost kubelet[453]: E0616 15:31:26.449417 453 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"09dac84b2502cb73358e9822ae55b71cb45e2417696b12f70ad8ba47b547f539\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-6d4b75cb6d-vsv27"Jun 16 15:31:26 localhost kubelet[453]: E0616 15:31:26.449449 453 kuberuntime_manager.go:815] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"09dac84b2502cb73358e9822ae55b71cb45e2417696b12f70ad8ba47b547f539\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-6d4b75cb6d-vsv27"Jun 16 15:31:26 localhost kubelet[453]: E0616 15:31:26.449520 453 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"coredns-6d4b75cb6d-vsv27_kube-system(66fee7e4-47ac-4e6e-b461-2058192f5ea3)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"coredns-6d4b75cb6d-vsv27_kube-system(66fee7e4-47ac-4e6e-b461-2058192f5ea3)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"09dac84b2502cb73358e9822ae55b71cb45e2417696b12f70ad8ba47b547f539\\\": open /run/flannel/subnet.env: no such file or directory\"" pod="kube-system/coredns-6d4b75cb6d-vsv27" podUID=66fee7e4-47ac-4e6e-b461-2058192f5ea3
The Kubernetes component log files are found in /var/log/container.
These are the files you need to explore to determine what the problem is, and there is plenty of information online to assist in finding a resolution.
etcd-kubernetes-master_kube-system_etcd-*
kube-apiserver-kubernetes-master_kube-system_kube-apiserver-*kube-controller-manager-kubernetes-master_kube-system_kube-controller-manager-*kube-proxy-r47t4_kube-system_kube-proxy-*kube-scheduler-kubernetes-master_kube-system_kube-scheduler-*
Examining these logs gave insight but no resolution was found.
SUMMARY — REALITY CHECK!!!
Several days have passed, and this current configuration is still not error free.
A new direction is needed.
In an effort to save time, the following series of photos acts as a summary of this author’s thoughts and feelings regarding how incredibly valuable and rewarding troubleshooting this cluster has been.
!! News Flash !!
A fully functional Kubernetes cluster was successfully created. As of this moment, it has been running for the past three hours without issue.
How was it done?
Find out in the next write up which is tentatively titled,
Redemption: Return of the Orchestrator