Cloud Desktop Teaching Platforms

This course lesson covers advanced strategies for scaling and upgrading Kubernetes clusters, with a focus on practical implementation using MiniKube and Infrastructure as Code (IaC). It explores node scaling (up/down), control plane high availability (HA) configurations, node draining and cordoning procedures, and Kubernetes version upgrades. The session then transitions into Helm chart management, including chart structure, values customization, upstream repository integration, and troubleshooting common Helm deployment issues. The instructor emphasizes real-world best practices, such as using GitOps for stateless workload migration, maintaining odd-numbered control planes, and securing clusters with node-level firewalls.

Topic (Timeline)

1. Kubernetes Cluster Scaling and Upgrading Concepts [00:00:00 - 00:10:32]

Introduces two primary methods for scaling/upgrading Kubernetes: direct node management (via kubeadm/minikube) and Infrastructure as Code (IaC) using Terraform or similar tools.
Explains that IaC is preferred for production environments, especially when clusters are provisioned declaratively.
Describes the Cluster Mesh concept using Cilium to connect two clusters (old and new Kubernetes versions) and drain workloads from the old to the new cluster without downtime.
Notes that Cluster Mesh has matured significantly, reducing cost and complexity from early beta implementations.
Recommends a simpler, safer approach: build a new cluster with the latest version, use GitOps to replicate stateless workloads, then switch DNS from old to new ingress.

2. Node Scaling: Downward and Upward Procedures [00:10:32 - 00:19:05]

Scaling Down: Three-step process: 1) Cordon node (prevents new pod scheduling), 2) Drain node (migrates existing pods to other nodes), 3) Delete node (removes from cluster).
Scaling Up: Three-step process: 1) Create new node (VM with OS and kubelet), 2) Join node to cluster (using certificate and token), 3) Uncordon node (allows scheduler to assign pods).
Emphasizes that control plane nodes must be deployed in odd numbers (1, 3, 5) to maintain quorum for leader election.
Explains that 3-node control planes can tolerate 1 node failure; 5-node control planes can tolerate 2 failures before losing scheduling capability.
Notes that most clusters are under 20 nodes; 5-node control planes are recommended only for clusters exceeding 100 nodes due to resource overhead.

3. Node Upgrading and MiniKube HA Demonstration [00:19:05 - 00:25:37]

Upgrading a node: Cordon → Drain → Upgrade Kubernetes version (avoid .0 or .1 releases) → Restart node → Uncordon.
Warns that 10% of upgrades may fail due to node unresponsiveness; not recommended for production without testing.
Demonstrates scaling MiniKube HA control plane: deleting and adding control plane nodes using minikube node delete and minikube node add.
Shows that MiniKube HA with Cilium and KubeVip is not fully HA-capable; removing the primary control plane (m0-1) causes cluster instability due to misconfigured HA components.
Concludes that MiniKube HA is a learning tool, not a production substitute; real HA requires proper KubeVip, etcd, and CNI configuration.

4. Kubernetes Troubleshooting Fundamentals [00:25:37 - 00:32:38]

Highlights key troubleshooting tools: kubectl get events, kubectl logs, and kubectl describe.
Notes that events are time-limited (1 hour) and logs are deleted when containers are restarted or removed.
Identifies hidden resource constraints: PID limits and inode exhaustion (especially in databases with many small files).
Explains taints/tolerations, node pressure (disk, memory, CPU), and pod readiness as critical failure points.
Emphasizes that services, ingresses, and headless services require valid endpoints to function.
Stresses the importance of labels and namespaces for resource discovery and filtering.

5. Helm Chart Structure and Templating [00:33:39 - 00:49:01]

Introduces Helm as the standard for production Kubernetes deployments via templated charts.
Breaks down Helm chart structure: Chart.yaml (chart and app versions), values.yaml (user-configurable variables), templates/ (Kubernetes manifests with Go templates), charts/ (dependency charts).
Explains that upstream chart maintainers (“chart captains”) control templates; users should only modify values.yaml.
Notes common upstream chart issues: missing nodeSelector/tolerations, outdated image versions, undocumented ports, and lack of HA support.
Recommends installing UFW firewalls on nodes to detect unexpected container port usage.

6. Helm Deployment, Customization, and Troubleshooting [00:49:01 - 01:07:37]

Demonstrates creating a Helm chart with helm create test-app, examining its default values.yaml, and deploying it with helm install.
Shows how to enable ingress by modifying values.yaml and redeploying.
Highlights that default Helm charts often contain bugs or outdated configurations.
Adds the Cilium Helm repository via helm repo add cilium https://helm.cilium.io.
Uses helm repo update to fetch latest chart versions and helm show values cilium to generate a custom values.yaml.
Installs Cilium with a custom values file, specifying version 1.17.5 and namespace cilium-system.
Demonstrates troubleshooting: Cilium pods fail to start due to port conflicts on a single-node MiniKube cluster.
Recommends adding a worker node or reducing Cilium operator replicas in values.yaml to resolve conflicts.
Notes that Cilium requires KubeVip for HA, and that all Cilium components (operator, Hubble, etc.) must be configured for HA and scheduled on control plane nodes.

7. Helm Management and Final Review [01:07:37 - 01:11:42]

Shows how to upgrade a Helm release using helm upgrade with a modified values.yaml.
Demonstrates listing releases with helm list and uninstalling with helm uninstall.
Concludes with a summary of key learnings: Helm chart anatomy, customizing values, managing repositories, installing/upgrading charts, and using firewalls to detect port issues.
Notes that the session intentionally skipped Longhorn and other storage topics due to time constraints.

Appendix

Key Principles

Control Plane HA: Always use odd numbers (3 or 5) for control plane nodes to ensure quorum and leader election.
IaC First: Prefer Infrastructure as Code (Terraform, etc.) over kubeadm/minikube for production cluster lifecycle management.
GitOps for Migration: For upgrades, deploy a new cluster, replicate stateless workloads via GitOps, then switch DNS—avoid in-place upgrades in production.
Cilium Cluster Mesh: Enables seamless workload migration between clusters of different Kubernetes versions; requires proper CNI and network configuration.
Node Lifecycle: Always cordone → drain → delete for scaling down; create → join → uncordon for scaling up.

Tools Used

MiniKube: For local development and demonstration of scaling/upgrading.
Helm: For templated, versioned deployment of applications and CNI (Cilium).
Cilium: CNI with Cluster Mesh and Hubble observability.
KubeVip: For load balancing and VIP management in HA setups (not fully functional in MiniKube demo).
UFW (Uncomplicated Firewall): Recommended for node-level port monitoring and security.

Common Pitfalls

Inode Exhaustion: Databases (PostgreSQL, MySQL) can crash after prolonged operation due to small file creation.
PID Limits: Containerized applications may fail silently if system PID limits are exceeded.
Outdated Helm Charts: Upstream charts often ship with old image versions or broken templates due to lack of active maintainers.
Port Conflicts: Cilium and other CNIs may fail to install on single-node clusters due to port binding conflicts.
Missing Tolerations/NodeSelectors: Upstream Helm charts often ignore these, locking workloads to arbitrary nodes.

Practice Suggestions

Create a custom Helm chart from scratch and deploy it with modified values.yaml.
Simulate a Kubernetes upgrade using two MiniKube clusters and Cilium Cluster Mesh.
Use kubectl get events --watch and kubectl describe pod <name> to troubleshoot failed deployments in real time.
Install UFW on a MiniKube node and monitor which ports a Helm-deployed application attempts to use.
Test HA control plane resilience by deleting nodes and observing leader election behavior.

Kubernetes Comprehensive 2-Day - lane-nbsb-20250628-014510

Visit NobleProg websites for related course

Summary

Overview