Pod 故障排查

2026-05-27·CKA k8s 练习

CKA 考试 Domain 5 — Pod 常见故障排查、CrashLoopBackOff、ImagePullBackOff、Pending 状态

← 返回 CKA 练习目录 Pod 是 Kubernetes 最小的调度单元，Pod 故障是 CKA 考试中最常见的排错场景。

1. Pod 状态速查

状态	含义
`Pending`	Pod 尚未调度，或正在拉取镜像
`Running`	Pod 正常运行
`CrashLoopBackOff`	容器反复崩溃重启
`ImagePullBackOff`	镜像拉取失败
`ErrImagePull`	镜像拉取出错
`OOMKilled`	容器内存超限被杀
`CreateContainerConfigError`	容器配置错误（如 ConfigMap 不存在）
`Init:Error` / `Init:CrashLoopBackOff`	Init 容器失败
`Terminating`	Pod 正在终止（可能卡住）

2. CrashLoopBackOff 排查

# 1. 查看 Pod 状态
kubectl get pods

# 2. 查看容器日志
kubectl logs <pod-name>

# 3. 查看前一个崩溃实例的日志
kubectl logs <pod-name> --previous

# 4. 查看 Pod 详情（查找 Events 部分的错误原因）
kubectl describe pod <pod-name>

# 5. 进入容器内部检查
kubectl exec -it <pod-name> -- /bin/sh

常见原因：

原因	排查方法
应用代码错误	`kubectl logs` 查看报错
启动命令失败	检查 Dockerfile ENTRYPOINT / CMD
配置错误	检查 ConfigMap / Secret 挂载
健康检查失败	检查 liveness / readiness 探针配置
端口冲突	检查 containerPort 配置

3. ImagePullBackOff / ErrImagePull 排查

# 1. 查看 Pod 详情
kubectl describe pod <pod-name>

# 输出中会看到类似：
# Failed to pull image "nginx:latst": rpc error: ...
# Error: ErrImagePull
# Back-off pulling image "nginx:latst"

常见原因及解决：

原因	解决
镜像名称拼写错误	检查 image 字段，如 `nginx:latst` 应为 `nginx:latest`
镜像标签不存在	使用 `kubectl edit pod` 修改标签
私有仓库未认证	创建 ImagePullSecret
仓库不可达	检查网络连通性
镜像不存在	确认镜像已被推送到仓库

私有仓库认证：

# 创建 Docker registry Secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass> \
  --docker-email=<email>

# 在 Pod 中引用
# spec:
#   imagePullSecrets:
#     - name: regcred

4. Pending 状态排查

kubectl describe pod <pod-name>

Events 中会显示调度失败原因：

原因	解决
`0/1 nodes are available: Insufficient cpu`	节点 CPU 资源不足
`0/1 nodes are available: Insufficient memory`	节点内存资源不足
`0/1 nodes are available: node(s) had taint`	节点有污点，需容忍
`0/1 nodes are available: pod has unbound PVC`	PVC 未绑定或不存在
`0/1 nodes are available: node(s) didn't match node selector`	节点标签不匹配

检查资源：

# 查看节点资源容量
kubectl describe node <node-name>

# 查看节点资源分配
kubectl top node

# 查看 Pod 资源请求
kubectl get pod <pod-name> -o yaml | grep -A 5 resources

5. OOMKilled（内存超限）

# 状态为 OOMKilled
kubectl get pod
# NAME    STATUS     RESTARTS
# my-pod  OOMKilled  5

# 查看日志（容器被 OOM kill 后日志可能丢失）
kubectl logs <pod-name> --previous

# 查看容器退出原因
kubectl describe pod <pod-name>
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

解决方法：

# 增加内存限制
kubectl set resources pod <pod-name> --limits=memory=512Mi
# 或编辑 Pod（Deployment）
kubectl edit deployment <deployment-name>

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

6. Init 容器失败

# 查看 Init 容器状态
kubectl describe pod <pod-name>

# 查看 Init 容器日志
kubectl logs <pod-name> -c <init-container-name>

# 查看前一个 Init 容器日志
kubectl logs <pod-name> -c <init-container-name> --previous

示例：

spec:
  initContainers:
    - name: init-setup
      image: busybox
      command: ["sh", "-c", "echo 'init done'"]
  containers:
    - name: app
      image: nginx

7. Readiness / Liveness 探针失败

kubectl describe pod <pod-name>

Events 中会出现：

Warning  Unhealthy  3s (x5 over 30s)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy  10s (x3 over 50s)  kubelet  Readiness probe failed: Get "http://10.244.1.2:8080/healthz": dial tcp 10.244.1.2:8080: connect: connection refused

排查步骤：

# 1. 确认应用端口
kubectl exec <pod-name> -- netstat -tlnp

# 2. 测试探针路径
kubectl exec <pod-name> -- wget -qO- http://localhost:8080/healthz

# 3. 检查探针配置
kubectl get pod <pod-name> -o yaml | grep -A 15 livenessProbe

8. kubectl exec 进入容器诊断

# 进入容器 shell
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec -it <pod-name> -- /bin/bash

# 在容器中执行命令
kubectl exec <pod-name> -- ls /app
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- cat /etc/config/config.yaml

# 指定容器（多容器 Pod）
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

9. kubectl debug 临时调试容器

Kubernetes v1.25+ 通过 kubectl debug 支持临时调试容器（Ephemeral Container）。

# 在运行中的 Pod 中添加调试容器
kubectl debug <pod-name> -it --image=busybox

# 复制 Pod 并替换镜像进行调试
kubectl debug <pod-name> -it --copy-to=<debug-name> --container=<container> --image=busybox

# 为节点创建调试 Pod
kubectl debug node/<node-name> -it --image=busybox

10. 通用排查命令速查

# Pod 状态概览
kubectl get pods -o wide
kubectl get pods --all-namespaces | grep -v Running

# 查看事件
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -w

# 查看完整 YAML
kubectl get pod <pod-name> -o yaml

# 查看所有资源事件
kubectl get events --all-namespaces

11. 考试要点

CrashLoopBackOff 先看 kubectl logs，再看 kubectl describe
ImagePullBackOff 通常是镜像名写错了
Pending 看 Events 中的调度失败原因
OOMKilled 的 Exit Code 是 137
Init 容器日志用 -c <container-name> 查看
kubectl describe pod 的 Events 部分是最重要的诊断信息

🧪 完整操作实例：排查 CrashLoopBackOff

场景描述

一个 Pod 反复崩溃重启（CrashLoopBackOff），从头到尾排查问题，涵盖日志查看、配置检查、修复和结果验证的完整流程。

前置条件

集群中有一个处于 CrashLoopBackOff 状态的 Pod
具有 kubectl logs 和 kubectl describe 的权限

操作步骤

Step 1: 发现异常 Pod

kubectl get pods
# NAME                      READY   STATUS             RESTARTS      AGE
# nginx-crash               0/1     CrashLoopBackOff   5 (15s ago)   2m
# web-app                   1/1     Running            0             10m

Step 2: 查看 Pod 详情（Events 中找线索）

kubectl describe pod nginx-crash
# ...
# Containers:
#   nginx:
#     Container ID:   containerd://abc123
#     State:          Waiting
#       Reason:       CrashLoopBackOff
#     Last State:     Terminated
#       Reason:       Error
#       Exit Code:    1
#       Finished At:  2026-05-27T10:01:00Z
#     ...
# Events:
#   Type     Reason     Age                   From               Message
#   ----     ------     ----                  ----               -------
#   Normal   Scheduled  3m                    default-scheduler  Successfully assigned default/nginx-crash to worker-node1
#   Normal   Pulled     3m                    kubelet            Successfully pulled image "nginx:latest" in 2.345s
#   Normal   Created    3m                    kubelet            Created container nginx
#   Normal   Started    3m                    kubelet            Started container nginx
#   Warning  BackOff    15s (x5 over 2m40s)   kubelet            Back-off restarting failed container

Exit Code 为 1，表示应用进程异常退出。

Step 3: 查看当前实例日志

kubectl logs nginx-crash
# 2026/05/27 10:00:00 [emerg] 1#1: open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)
# nginx: [emerg] open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)

发现 Nginx 找不到配置文件。

Step 4: 查看前一个崩溃实例的日志（如有需要）

kubectl logs nginx-crash --previous
# （与当前日志相同，说明每次崩溃原因一致）

Step 5: 进入容器检查配置（使用非崩溃 Pod 的镜像测试）

# 由于容器持续崩溃，使用 kubectl debug 创建调试副本
kubectl debug nginx-crash -it --image=nginx --copy-to=nginx-debug -- /bin/bash
# 或在运行的调试容器中检查
kubectl exec -it nginx-debug -- ls -la /etc/nginx/
# 发现缺少 nginx.conf 文件 → 配置问题

Step 6: 修复问题

# 检查 Deployment / Pod 的配置，找到问题根源
# 原 Pod YAML 中可能挂载了错误的 ConfigMap 覆盖了 nginx.conf

# 方法 1：直接编辑 Deployment 修正配置
kubectl edit deployment nginx-crash
# 修复挂载的 ConfigMap 名称或路径

# 方法 2：如果 ConfigMap 内容错误，修改 ConfigMap
kubectl edit configmap nginx-config
# 确保包含正确的 nginx.conf 内容

Step 7: 验证 Pod 恢复运行

kubectl get pods -w
# nginx-crash               1/1     Running            0               30s
# → Pod 已恢复正常运行

kubectl describe pod nginx-crash
# State:          Running
#   Started:      ...
# Events 中不再有 CrashLoopBackOff

验证结果

# 验证 Pod 状态稳定
kubectl get pods nginx-crash
# NAME          READY   STATUS    RESTARTS   AGE
# nginx-crash   1/1     Running   0          1m

# 验证服务正常响应
kubectl port-forward pod/nginx-crash 8080:80 &
curl http://localhost:8080
# <!DOCTYPE html>
# <html>...（Nginx 首页正常返回）

考试提示

CrashLoopBackOff 排查顺序：kubectl describe → kubectl logs → kubectl logs --previous
Exit Code 含义：0=正常退出，1=应用错误，137=OOMKilled（SIGKILL），143=优雅终止（SIGTERM）
kubectl logs --previous 查看崩溃前的日志，在容器反复重启时非常有价值
如果容器启动太快来不及查看日志，使用 kubectl debug 创建调试副本
检查 liveness/readiness 探针配置错误也是常见原因