Qingular

etcd 备份与恢复

·CKAk8s练习

etcd 是 Kubernetes 的核心数据存储,掌握 etcd 快照备份、恢复和成员管理是 CKA 考试的关键技能。

← 返回 CKA 练习目录

概述

etcd 是 Kubernetes 集群的键值存储数据库,保存了所有集群状态(Pod、Service、ConfigMap 等资源数据)。etcd 备份与恢复是 CKA 考试的重点实操内容,也是灾难恢复中的关键技能。


1. etcd 基础

1.1 etcd 架构在 Kubernetes 中的角色

┌─────────────────────────────────────────┐
│              API Server                   │
│            (唯一访问 etcd 的组件)          │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│                etcd                      │
│    ┌──────────┼──────────┐              │
│    │ member1  │ member2  │ member3      │
│    │ (leader) │(follower)│(follower)    │
│    └──────────┴──────────┴──────────────┘
│    Raft Consensus Protocol                │
│    多数派(N/2+1)写入成功才返回           │
└─────────────────────────────────────────┘

1.2 etcd 关键目录与文件

# etcd 数据目录(默认)
/var/lib/etcd/

# etcd 配置文件(静态 Pod)
/etc/kubernetes/manifests/etcd.yaml

# etcd TLS 证书
/etc/kubernetes/pki/etcd/
├── ca.crt                   # etcd CA 证书
├── server.crt               # etcd 服务端证书
├── server.key               # etcd 服务端密钥
├── peer.crt                 # etcd 对等证书(集群通信)
├── peer.key                 # etcd 对等密钥
├── healthcheck-client.crt   # 健康检查客户端证书
└── healthcheck-client.key   # 健康检查客户端密钥

2. etcdctl 安装与配置

2.1 安装 etcdctl

# 方法一:从 kubeadm 控制平面节点直接使用
# etcdctl 通常已安装在控制平面节点上
which etcdctl

# 方法二:下载 etcd 二进制
wget https://github.com/etcd-io/etcd/releases/download/v3.5.15/etcd-v3.5.15-linux-amd64.tar.gz
tar xzvf etcd-v3.5.15-linux-amd64.tar.gz
sudo cp etcd-v3.5.15-linux-amd64/etcdctl /usr/local/bin/

# 设置环境变量(重要!)
export ETCDCTL_API=3
alias etcdctl='etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                      --cert=/etc/kubernetes/pki/etcd/server.crt \
                      --key=/etc/kubernetes/pki/etcd/server.key'

2.2 TLS 连接参数

参数说明默认路径
--cacertCA 证书(验证 etcd 服务端)/etc/kubernetes/pki/etcd/ca.crt
--cert客户端证书(身份认证)/etc/kubernetes/pki/etcd/server.crt
--key客户端密钥/etc/kubernetes/pki/etcd/server.key
--endpointsetcd 节点地址https://127.0.0.1:2379
# 为方便使用,设置别名
alias ectl='ETCDCTL_API=3 etcdctl \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --endpoints=https://127.0.0.1:2379'

3. etcd 快照备份

3.1 创建快照

# 基本备份命令
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 指定 endpoints(多 etcd 集群时选择其中一个即可)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
    --endpoints=https://192.168.1.10:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 使用别名简化(如果已设置)
ectl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db

3.2 验证快照

# 查看快照状态
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db

# 输出示例:
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, total size, 是否已损坏)

# 详细状态查看
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db -w table

# 输出示例(tabular format):
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL SIZE |   STATUS   |
# +----------+----------+------------+------------+
# | 2f0e0b8  |  243850  |   1.8MB    | ok/ corrupted |
# +----------+----------+------------+------------+

# 创建带日期的备份脚本
cat <<'EOF' > /usr/local/bin/backup-etcd.sh
#!/bin/bash
BACKUP_DIR="/backup/etcd"
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$DATE.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key
# 保留最近 7 天备份
find $BACKUP_DIR -name "etcd-snapshot-*.db" -mtime +7 -delete
EOF
chmod +x /usr/local/bin/backup-etcd.sh

4. etcd 快照恢复

4.1 单个 etcd 节点恢复

# 完整恢复流程

# 1. 停止 API Server(很重要,防止恢复过程中数据写入)
# 将 etcd 静态 Pod 清单移出 manifests 目录
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30  # 等待 Pod 停止

# 2. 备份当前数据目录
sudo mv /var/lib/etcd /var/lib/etcd.bak

# 3. 从快照恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd

# 4. 设置正确的权限
sudo chown -R etcd:etcd /var/lib/etcd

# 5. 恢复静态 Pod 清单
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 6. 等待 Pod 启动
sleep 30
kubectl get pods -n kube-system | grep -E "etcd|kube-apiserver"

4.2 指定恢复参数

# snapshot restore 的可用参数
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd-restored \
    --name=etcd-0 \
    --initial-cluster=etcd-0=https://192.168.1.10:2380 \
    --initial-cluster-token=etcd-cluster \
    --initial-advertise-peer-urls=https://192.168.1.10:2380

4.3 多节点 etcd 集群恢复

# 在每个 etcd 节点上执行恢复

# 节点 1(恢复后作为初始集群成员)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-1 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.10:2380

# 节点 2
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-2 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.11:2380

# 节点 3
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-3 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.12:2380

5. etcd 成员管理

5.1 查看成员

# 列出 etcd 集群成员
ETCDCTL_API=3 etcdctl member list \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 以表格格式查看
ETCDCTL_API=3 etcdctl member list -w table \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 输出示例:
# +------------------+---------+--------+---------------------------+---------------------------+
# |        ID        | STATUS  |  NAME  |       PEER ADDRS          |       CLIENT ADDRS        |
# +------------------+---------+--------+---------------------------+---------------------------+
# | 8e9e05c52164694d | started | cp-1   | https://192.168.1.10:2380 | https://192.168.1.10:2379 |
# | 6a4d1c8352a47abd | started | cp-2   | https://192.168.1.11:2380 | https://192.168.1.11:2379 |
# | 4f2c7a9621c4a3ef | started | cp-3   | https://192.168.1.12:2380 | https://192.168.1.12:2379 |
# +------------------+---------+--------+---------------------------+---------------------------+

5.2 添加/移除成员

# 添加新成员
ETCDCTL_API=3 etcdctl member add etcd-4 \
    --peer-urls=https://192.168.1.13:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 移除成员
ETCDCTL_API=3 etcdctl member remove <member-id> \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 更新成员
ETCDCTL_API=3 etcdctl member update <member-id> \
    --peer-urls=https://192.168.1.14:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

5.3 健康检查

# 检查单个 etcd 端点健康
ETCDCTL_API=3 etcdctl endpoint health \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 检查集群所有端点
ETCDCTL_API=3 etcdctl endpoint health --cluster \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 查看端点状态(包括版本、DB 大小等)
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

6. 灾难恢复完整流程

6.1 完全损坏 -- 单节点 etcd

# 场景:唯一 etcd 节点数据完全损坏

# 1. 停止所有控制平面组件
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 30

# 2. 删除损坏数据
sudo rm -rf /var/lib/etcd

# 3. 从快照恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd

# 4. 设置权限
sudo chown -R etcd:etcd /var/lib/etcd

# 5. 恢复控制平面组件
sudo mv /tmp/*.yaml /etc/kubernetes/manifests/

# 6. 验证恢复
sleep 60
kubectl get nodes
kubectl get pods --all-namespaces

6.2 多数 etcd 节点故障 -- HA 集群

# 场景:3 节点 etcd 集群中有 2 个节点不可恢复

# 1. 在幸存的 etcd 节点上备份
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-emergency.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 2. 在幸存节点上停止 etcd
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# 3. 使用 force-new-cluster 选项恢复
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-emergency.db \
    --data-dir=/var/lib/etcd-new \
    --force-new-cluster

# 4. 替换数据目录
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-new /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd

# 5. 恢复 etcd 静态 Pod
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/

# 6. 逐个添加新的 etcd 成员
ETCDCTL_API=3 etcdctl member add new-member \
    --peer-urls=https://192.168.1.14:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

7. 使用 kubeadm 检查 etcd

# kubeadm 也提供了 etcd 健康检查
sudo kubeadm init phase etcd local --config=/etc/kubernetes/kubeadm-config.yaml

# 查看 etcd Pod 日志
kubectl logs -n kube-system etcd-<node-name> --tail=100

# 进入 etcd Pod 内部
kubectl exec -n kube-system etcd-<node-name> -it -- sh

CKA 考试要点

  1. 必须设置 ETCDCTL_API=3 -- 否则 etcdctl 默认使用 v2 API,无法使用 snapshot 功能
  2. TLS 证书参数 -- 考试中 etcdctl 必须指定 --cacert--cert--key
  3. 恢复时必须先停止 API Server -- 移动 etcd 和 apiserver 的静态 Pod 清单
  4. --data-dir 指定恢复路径 -- 恢复后的数据目录需要与 etcd 配置一致
  5. 恢复后设置权限 -- sudo chown -R etcd:etcd /var/lib/etcd

🧪 完整操作实例:etcd 备份与灾难恢复

场景描述

对 etcd 进行快照备份,然后模拟数据损坏场景,从快照恢复集群。

前置条件

  • 具有对控制平面节点的 sudo 访问权限
  • etcdctl 已安装(v3 API)
  • etcd TLS 证书文件存在于 /etc/kubernetes/pki/etcd/

操作步骤

Step 1: 创建 etcd 快照备份

# 设置环境变量(重要:必须指定 API=3)
export ETCDCTL_API=3

# 创建备份目录
sudo mkdir -p /backup

# 执行快照备份
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /backup/etcd-snapshot-20250527.db

Step 2: 验证快照文件

# 检查快照状态
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, size, 是否损坏: false = 正常)

# 以表格形式查看
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db -w table
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL SIZE |   STATUS   |
# +----------+----------+------------+------------+
# | 2f0e0b8  |  243850  |   1.8MB    |   ok       |
# +----------+----------+------------+------------+

Step 3: 模拟故障(停止 etcd 和 API Server)

# 将 etcd 和 API Server 的静态 Pod 清单移出 manifests 目录
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# 等待 Pod 完全停止
sleep 30

# 验证 etcd Pod 已停止
sudo crictl ps | grep etcd
# (无输出,表示 etcd 已停止)

# 删除当前 etcd 数据目录(模拟数据损坏)
sudo rm -rf /var/lib/etcd

Step 4: 从快照恢复

# 从快照恢复到数据目录
sudo ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20250527.db \
  --data-dir=/var/lib/etcd

# 设置正确的权限
sudo chown -R etcd:etcd /var/lib/etcd

# 验证数据目录已恢复
ls -la /var/lib/etcd/
# total 24
# drwx------  4 etcd etcd 4096 May 27 10:00 .
# drwxr-xr-x  3 root root 4096 May 27 10:00 ..
# drwx------  3 etcd etcd 4096 May 27 10:00 member

Step 5: 恢复控制平面组件

# 将 etcd 和 API Server 清单移回
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 等待 Pod 启动(约 30-60 秒)
sleep 60

验证结果

# 验证 etcd Pod 运行
kubectl get pods -n kube-system | grep etcd
# etcd-control-plane-1    1/1     Running   0   1m

# 验证 API Server 运行
kubectl get pods -n kube-system | grep kube-apiserver
# kube-apiserver-control-plane-1    1/1     Running   0   1m

# 验证集群资源已恢复
kubectl get nodes
kubectl get pods --all-namespaces
# 恢复前的所有资源应可见

# 验证 etcd 健康
kubectl exec -n kube-system etcd-control-plane-1 -- etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.245672ms

考试提示

  • 必须设置 ETCDCTL_API=3 -- 忘记设置会导致 etcdctl 使用 v2 API,snapshot 命令不可用
  • 恢复前必须先停止 API Server -- 防止恢复过程中有数据写入导致不一致
  • TLS 证书参数不可省略 -- etcdctl 每个命令都需要指定 --cacert--cert--key
  • 恢复后设置权限 -- sudo chown -R etcd:etcd /var/lib/etcd 不能忘,否则 etcd 无法启动
  • 使用 -w table 参数可以更清晰地查看 etcdctl 的输出

官方文档