Qingular

Control Plane Troubleshooting

·CKAk8s练习

CKA Exam Domain 5 — API Server, Scheduler, Controller Manager, and etcd troubleshooting

← Back to CKA Practice Index The control plane is the brain of a Kubernetes cluster. In the CKA exam, control plane component troubleshooting commonly involves scenarios such as static Pod configuration, etcd health checks, and component restarts.


1. Control Plane Component Overview

ComponentFunctionDeployment Method
kube-apiserverEntry point for all API requestsStatic Pod (/etc/kubernetes/manifests/)
kube-schedulerPod scheduling decisionsStatic Pod
kube-controller-managerController managementStatic Pod
etcdCluster data storageStatic Pod
# View control plane Pods
kubectl get pods -n kube-system

# View the static Pod configuration directory
ls /etc/kubernetes/manifests/
# kube-apiserver.yaml
# kube-scheduler.yaml
# kube-controller-manager.yaml
# etcd.yaml

2. API Server Troubleshooting

The API Server is the core component of the cluster. When it is inaccessible, the entire cluster becomes unusable.

Check API Server Status

# Check API Server Pod
kubectl get pods -n kube-system | grep apiserver

# View API Server logs
kubectl logs -n kube-system kube-apiserver-<node-name>
kubectl logs -n kube-system kube-apiserver-<node-name> --tail=100

# If API Server is completely unavailable, use Docker/containerd on the master node
crictl ps | grep apiserver
crictl logs <container-id>

Static Pod Configuration Repair

# Check the static Pod configuration file
cat /etc/kubernetes/manifests/kube-apiserver.yaml

# Common issues: incorrect certificate paths, incorrect etcd addresses, incorrect service-cluster-ip-range
# After modifying the configuration, kubelet will automatically recreate the static Pod
vi /etc/kubernetes/manifests/kube-apiserver.yaml

Troubleshooting Steps When API Server Is Unavailable

# Step 1: SSH to the master node
ssh <master-node>

# Step 2: Check if the static Pod configuration file exists
ls -la /etc/kubernetes/manifests/kube-apiserver.yaml

# Step 3: Check if kubelet is running
systemctl status kubelet

# Step 4: Check the container runtime
crictl ps | grep apiserver

# Step 5: View kubelet logs to locate the issue
journalctl -u kubelet -n 50 --no-pager

3. Scheduler Troubleshooting

Pod Not Being Scheduled

# View unscheduled Pods
kubectl get pods --all-namespaces | grep Pending

# View scheduling failure reasons
kubectl describe pod <pod-name>
# Events:
#   FailedScheduling  30s  default-scheduler  0/2 nodes are available

# View Scheduler logs
kubectl logs -n kube-system kube-scheduler-<master-name>

Common Scheduling Issues

IssueCauseSolution
Pod PendingInsufficient node resourcesAdd nodes or adjust resource requests
Pod PendingNode has taintsAdd tolerations
Pod PendingNode selector mismatchModify nodeSelector
Pod not scheduled to desired nodeIncorrect weight or affinity configurationCheck affinity configuration

Scheduler Configuration Check

# Check Scheduler configuration file
cat /etc/kubernetes/manifests/kube-scheduler.yaml

# Check Scheduler startup parameters
kubectl get pods -n kube-system kube-scheduler-<master-name> -o yaml

4. Controller Manager Troubleshooting

# View Controller Manager logs
kubectl logs -n kube-system kube-controller-manager-<master-name> --tail=50

# Check Controller Manager configuration
cat /etc/kubernetes/manifests/kube-controller-manager.yaml

# Check if control loops are functioning normally
# Logs should contain normal output from controllers: replicaset, deployment, node, serviceaccount, etc.

Common issues:

  • Node Controller not correctly marking node status
  • Deployment Controller not creating ReplicaSets
  • Service Account Controller not creating Tokens

5. etcd Member Health Check

# Method 1: Use etcdctl
# Note: etcdctl requires setting endpoint, certificate, and other environment variables

# Check etcd endpoint health
kubectl exec -it -n kube-system etcd-<master-name> -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --cluster

# List etcd members
kubectl exec -it -n kube-system etcd-<master-name> -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

# Check etcd logs
kubectl logs -n kube-system etcd-<master-name> --tail=100

Health check output example:

https://192.168.1.10:2379 is healthy: successfully committed proposal: took = 2.345ms
https://192.168.1.11:2379 is healthy: successfully committed proposal: took = 3.012ms
https://192.168.1.12:2379 is healthy: successfully committed proposal: took = 1.987ms

6. etcd Data Directory Full Handling

# Check etcd data directory size
du -sh /var/lib/etcd/

# Check disk space
df -h

# Compact etcd data (frees space but does not reduce data directory size)
kubectl exec -it -n kube-system etcd-<master-name> -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  compaction $(etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=json | jq -r '.[].Status.header.revision')

# Defragmentation (actually frees disk space)
kubectl exec -it -n kube-system etcd-<master-name> -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag

# Clean up etcd history after compaction
# etcd server cleans up automatically, or trigger manual compaction

7. Restarting Control Plane Components

# Method 1: Delete the Pod (static Pods will be recreated by kubelet)
kubectl delete pod -n kube-system kube-apiserver-<master-name>
kubectl delete pod -n kube-system kube-scheduler-<master-name>
kubectl delete pod -n kube-system kube-controller-manager-<master-name>
kubectl delete pod -n kube-system etcd-<master-name>

# Method 2: Move the static Pod configuration file (temporary removal)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Method 3: Modify the static Pod configuration to trigger a restart
# kubelet detects file changes and recreates the container
vi /etc/kubernetes/manifests/kube-apiserver.yaml

Note: Control plane components are static Pods. kubectl delete does not truly delete them; kubelet automatically restores them from /etc/kubernetes/manifests/.


8. General Troubleshooting Command Reference

# View status of all control plane components
kubectl get pods -n kube-system

# View component logs (latest 50 lines)
kubectl logs -n kube-system <pod-name> --tail=50

# View component events
kubectl get events -n kube-system --sort-by='.lastTimestamp'

# View control plane component configuration
kubectl get pods -n kube-system <pod-name> -o yaml

# Check the static Pod directory on the master node
ls -la /etc/kubernetes/manifests/

9. Exam Key Points

  • Control plane components are static Pods, configured under /etc/kubernetes/manifests/
  • kubectl delete on static Pods does not delete them; kubelet recreates them automatically
  • etcd health check uses etcdctl endpoint health
  • The etcd endpoint is typically https://127.0.0.1:2379
  • etcd certificate path: /etc/kubernetes/pki/etcd/
  • When API Server is unavailable, check static Pod configuration and kubelet logs
  • Common CKA exam issues: incorrect certificate paths, incorrect etcd endpoint addresses

🧪 Complete Hands-on Example: Troubleshoot API Server Failure

Scenario Description

The API Server is responding abnormally, and kubectl commands are not working. Walk through the troubleshooting process from checking control plane Pod status, static Pod configuration, and etcd health checks to full recovery.

Prerequisites

  • A cluster with a Master node
  • SSH access to the Master node
  • kubeadm tool available on the Master node

Steps

Step 1: Detect API Server anomaly

kubectl get nodes
# The connection to the server <master-ip>:6443 was refused - did you specify the right host or port?
# API Server is completely unavailable

Step 2: SSH to the Master node and check control plane Pods

ssh master-node

# Use containerd to directly view container status (since kubectl is unavailable)
crictl ps | grep apiserver
# If no output, the API Server container is not running

# Check all kube-system containers
crictl ps -a | grep -E "apiserver|scheduler|controller|etcd"
# CONTAINER ID    IMAGE    CREATED     STATUS      NAME
# ...             ...      10m ago     Exited      kube-apiserver

Step 3: Check the static Pod configuration file

ls -la /etc/kubernetes/manifests/
# total 16
# -rw------- 1 root root 2153 May 27 09:00 kube-apiserver.yaml
# -rw------- 1 root root 2000 May 27 09:00 kube-controller-manager.yaml
# -rw------- 1 root root 1585 May 27 09:00 kube-scheduler.yaml
# -rw------- 1 root root 1466 May 27 09:00 etcd.yaml

# Inspect API Server configuration (look for common configuration errors)
cat /etc/kubernetes/manifests/kube-apiserver.yaml
# Focus on:
# - --etcd-servers: Is the address correct?
# - --tls-cert-file / --tls-private-key-file: Do the certificate paths exist?
# - --service-cluster-ip-range: Is it valid?

Step 4: View kubelet logs to get API Server startup errors

sudo journalctl -u kubelet -n 50 --no-pager
# May 27 10:00:00 master-node kubelet[1234]: E1001 10:00:00.123456    1234 kubelet.go:1234] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: ..."
# May 27 10:00:01 master-node kubelet[1234]: E1001 10:00:01.123456    1234 kubelet.go:5678] "Unable to read config path" err="path does not exist, ignoring" path="/etc/kubernetes/manifests/kube-apiserver.yaml"

If the output shows that the configuration path does not exist, the configuration file has been accidentally deleted or moved.

Step 5: Check if certificates have expired

# Check API Server certificate
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# notBefore=May 27 09:00:00 2025 GMT
# notAfter=May 27 09:00:00 2026 GMT
# -> Certificate has expired!

# Check all certificate validity periods
sudo kubeadm certs check-expiration
# [check-expiration] Checking expiration for all certificates ...
# apiserver.crt          May 27 09:00:00 2026    to   May 27 09:00:00 2026    <expired
# apiserver-kubelet-client.crt  ...

Step 6: Renew certificates and restart components

# Renew all certificates
sudo kubeadm certs renew all
# certificate renewal succeeded

# Verify certificates have been renewed
sudo kubeadm certs check-expiration
# apiserver.crt          May 27 09:00:00 2026    to   May 27 10:30:00 2027    ← Extended

# Restart kubelet to apply new certificates
sudo systemctl restart kubelet

# Wait for static Pods to be recreated automatically
sleep 30

Step 7: Verify etcd health

# Use etcdctl to check etcd endpoint health
kubectl exec -it -n kube-system etcd-master-node -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.345ms

Step 8: Verify cluster recovery

# Verify from the Master node
kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# master-node    Ready    control-plane   10d   v1.28.0
# worker-node1   Ready    <none>          10d   v1.28.0

kubectl get pods -n kube-system | grep apiserver
# kube-apiserver-master-node   1/1     Running   0          2m

Verification Results

# Confirm API Server is running normally
kubectl get componentstatuses
# NAME                 STATUS    MESSAGE             ERROR
# controller-manager   Healthy   ok
# scheduler            Healthy   ok
# etcd-0               Healthy   {"health":"true"}

# Verify certificate validity
echo | openssl s_client -connect localhost:6443 2>/dev/null | openssl x509 -noout -dates

Exam Tips

  • When API Server is unavailable, SSH directly to the Master node and use crictl or docker to check container status
  • Control plane components are static Pods; configuration files are under /etc/kubernetes/manifests/
  • Certificate expiration is a common exam topic; use kubeadm certs renew all to renew
  • After fixing, systemctl restart kubelet triggers static Pod recreation
  • Use etcdctl endpoint health for etcd health checks; memorize the certificate paths

Official Documentation