Qingular

etcd Backup and Restore

·CKAk8s练习

etcd is the core data store of Kubernetes. Mastering etcd snapshot backup, restore, and member management is a key skill for the CKA exam.

← Back to CKA Practice Index

Overview

etcd is the key-value store database for a Kubernetes cluster, storing all cluster state (Pod, Service, ConfigMap, and other resource data). etcd backup and restore is a key hands-on topic in the CKA exam and a critical skill for disaster recovery.


1. etcd Basics

1.1 etcd Architecture and Role in Kubernetes

┌─────────────────────────────────────────┐
│              API Server                   │
│      (The only component that accesses     │
│                etcd)                      │
└─────────────────┬───────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────┐
│                etcd                      │
│    ┌──────────┼──────────┐              │
│    │ member1  │ member2  │ member3      │
│    │ (leader) │(follower)│(follower)    │
│    └──────────┴──────────┴──────────────┘
│    Raft Consensus Protocol                │
│    Majority (N/2+1) writes must succeed   │
│    before returning                       │
└─────────────────────────────────────────┘

1.2 Key etcd Directories and Files

# etcd data directory (default)
/var/lib/etcd/

# etcd configuration file (static Pod)
/etc/kubernetes/manifests/etcd.yaml

# etcd TLS certificates
/etc/kubernetes/pki/etcd/
├── ca.crt                   # etcd CA certificate
├── server.crt               # etcd server certificate
├── server.key               # etcd server key
├── peer.crt                 # etcd peer certificate (cluster communication)
├── peer.key                 # etcd peer key
├── healthcheck-client.crt   # Health check client certificate
└── healthcheck-client.key   # Health check client key

2. etcdctl Installation and Configuration

2.1 Installing etcdctl

# Method 1: Use directly from a kubeadm control plane node
# etcdctl is usually already installed on control plane nodes
which etcdctl

# Method 2: Download the etcd binary
wget https://github.com/etcd-io/etcd/releases/download/v3.5.15/etcd-v3.5.15-linux-amd64.tar.gz
tar xzvf etcd-v3.5.15-linux-amd64.tar.gz
sudo cp etcd-v3.5.15-linux-amd64/etcdctl /usr/local/bin/

# Set environment variables (important!)
export ETCDCTL_API=3
alias etcdctl='etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                      --cert=/etc/kubernetes/pki/etcd/server.crt \
                      --key=/etc/kubernetes/pki/etcd/server.key'

2.2 TLS Connection Parameters

ParameterDescriptionDefault Path
--cacertCA certificate (verify etcd server)/etc/kubernetes/pki/etcd/ca.crt
--certClient certificate (authentication)/etc/kubernetes/pki/etcd/server.crt
--keyClient key/etc/kubernetes/pki/etcd/server.key
--endpointsetcd node addresseshttps://127.0.0.1:2379
# Set an alias for convenience
alias ectl='ETCDCTL_API=3 etcdctl \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --endpoints=https://127.0.0.1:2379'

3. etcd Snapshot Backup

3.1 Creating a Snapshot

# Basic backup command
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Specify endpoints (choose one for multi-etcd clusters)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
    --endpoints=https://192.168.1.10:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Use alias to simplify (if already set)
ectl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db

3.2 Verifying a Snapshot

# View snapshot status
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db

# Output example:
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, total size, corrupted)

# Detailed status view
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db -w table

# Output example (tabular format):
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL SIZE |   STATUS   |
# +----------+----------+------------+------------+
# | 2f0e0b8  |  243850  |   1.8MB    | ok/ corrupted |
# +----------+----------+------------+------------+

# Create a dated backup script
cat <<'EOF' > /usr/local/bin/backup-etcd.sh
#!/bin/bash
BACKUP_DIR="/backup/etcd"
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$DATE.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key
# Keep only the last 7 days of backups
find $BACKUP_DIR -name "etcd-snapshot-*.db" -mtime +7 -delete
EOF
chmod +x /usr/local/bin/backup-etcd.sh

4. etcd Snapshot Restore

4.1 Single etcd Node Restore

# Complete restore process

# 1. Stop the API Server (important to prevent data writes during restore)
# Move the etcd static Pod manifest out of the manifests directory
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30  # Wait for Pods to stop

# 2. Back up the current data directory
sudo mv /var/lib/etcd /var/lib/etcd.bak

# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd

# 4. Set correct permissions
sudo chown -R etcd:etcd /var/lib/etcd

# 5. Restore static Pod manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 6. Wait for Pods to start
sleep 30
kubectl get pods -n kube-system | grep -E "etcd|kube-apiserver"

4.2 Specifying Restore Parameters

# Available parameters for snapshot restore
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd-restored \
    --name=etcd-0 \
    --initial-cluster=etcd-0=https://192.168.1.10:2380 \
    --initial-cluster-token=etcd-cluster \
    --initial-advertise-peer-urls=https://192.168.1.10:2380

4.3 Multi-Node etcd Cluster Restore

# Perform restore on each etcd node

# Node 1 (restored as initial cluster member)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-1 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.10:2380

# Node 2
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-2 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.11:2380

# Node 3
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd \
    --name=etcd-3 \
    --initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
    --initial-cluster-token=etcd-cluster-token \
    --initial-advertise-peer-urls=https://192.168.1.12:2380

5. etcd Member Management

5.1 Viewing Members

# List etcd cluster members
ETCDCTL_API=3 etcdctl member list \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# View in table format
ETCDCTL_API=3 etcdctl member list -w table \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Output example:
# +------------------+---------+--------+---------------------------+---------------------------+
# |        ID        | STATUS  |  NAME  |       PEER ADDRS          |       CLIENT ADDRS        |
# +------------------+---------+--------+---------------------------+---------------------------+
# | 8e9e05c52164694d | started | cp-1   | https://192.168.1.10:2380 | https://192.168.1.10:2379 |
# | 6a4d1c8352a47abd | started | cp-2   | https://192.168.1.11:2380 | https://192.168.1.11:2379 |
# | 4f2c7a9621c4a3ef | started | cp-3   | https://192.168.1.12:2380 | https://192.168.1.12:2379 |
# +------------------+---------+--------+---------------------------+---------------------------+

5.2 Adding/Removing Members

# Add a new member
ETCDCTL_API=3 etcdctl member add etcd-4 \
    --peer-urls=https://192.168.1.13:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Remove a member
ETCDCTL_API=3 etcdctl member remove <member-id> \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Update a member
ETCDCTL_API=3 etcdctl member update <member-id> \
    --peer-urls=https://192.168.1.14:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

5.3 Health Check

# Check the health of a single etcd endpoint
ETCDCTL_API=3 etcdctl endpoint health \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Check all cluster endpoints
ETCDCTL_API=3 etcdctl endpoint health --cluster \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# View endpoint status (including version, DB size, etc.)
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

6. Complete Disaster Recovery Workflow

6.1 Complete Corruption -- Single-Node etcd

# Scenario: The only etcd node's data is completely corrupted

# 1. Stop all control plane components
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 30

# 2. Delete corrupted data
sudo rm -rf /var/lib/etcd

# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
    --data-dir=/var/lib/etcd

# 4. Set permissions
sudo chown -R etcd:etcd /var/lib/etcd

# 5. Restore control plane components
sudo mv /tmp/*.yaml /etc/kubernetes/manifests/

# 6. Verify restoration
sleep 60
kubectl get nodes
kubectl get pods --all-namespaces

6.2 Majority etcd Node Failure -- HA Cluster

# Scenario: 2 out of 3 etcd nodes in the cluster are unrecoverable

# 1. Back up on the surviving etcd node
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-emergency.db \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# 2. Stop etcd on the surviving node
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# 3. Restore using the force-new-cluster option
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-emergency.db \
    --data-dir=/var/lib/etcd-new \
    --force-new-cluster

# 4. Replace the data directory
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-new /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd

# 5. Restore the etcd static Pod
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/

# 6. Add new etcd members one by one
ETCDCTL_API=3 etcdctl member add new-member \
    --peer-urls=https://192.168.1.14:2380 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

7. Checking etcd with kubeadm

# kubeadm also provides etcd health checks
sudo kubeadm init phase etcd local --config=/etc/kubernetes/kubeadm-config.yaml

# View etcd Pod logs
kubectl logs -n kube-system etcd-<node-name> --tail=100

# Enter the etcd Pod
kubectl exec -n kube-system etcd-<node-name> -it -- sh

CKA Exam Key Points

  1. Must set ETCDCTL_API=3 -- otherwise etcdctl defaults to the v2 API and snapshot functionality is unavailable
  2. TLS certificate parameters -- In the exam, etcdctl must specify --cacert, --cert, --key
  3. Must stop the API Server before restoring -- Move the etcd and apiserver static Pod manifests
  4. --data-dir specifies the restore path -- The restored data directory must match the etcd configuration
  5. Set permissions after restore -- sudo chown -R etcd:etcd /var/lib/etcd

🧪 Complete Hands-on Example: etcd Backup and Disaster Recovery

Scenario Description

Take an etcd snapshot backup, then simulate a data corruption scenario and restore the cluster from the snapshot.

Prerequisites

  • sudo access to the control plane node
  • etcdctl installed (v3 API)
  • etcd TLS certificate files present in /etc/kubernetes/pki/etcd/

Steps

Step 1: Create an etcd snapshot backup

# Set environment variable (important: must specify API=3)
export ETCDCTL_API=3

# Create backup directory
sudo mkdir -p /backup

# Execute snapshot backup
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /backup/etcd-snapshot-20250527.db

Step 2: Verify the snapshot file

# Check snapshot status
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, size, corrupted: false = normal)

# View in table format
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db -w table
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL SIZE |   STATUS   |
# +----------+----------+------------+------------+
# | 2f0e0b8  |  243850  |   1.8MB    |   ok       |
# +----------+----------+------------+------------+

Step 3: Simulate a failure (stop etcd and API Server)

# Move etcd and API Server static Pod manifests out of the manifests directory
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Wait for Pods to completely stop
sleep 30

# Verify etcd Pod has stopped
sudo crictl ps | grep etcd
# (no output, meaning etcd has stopped)

# Delete the current etcd data directory (simulate data corruption)
sudo rm -rf /var/lib/etcd

Step 4: Restore from snapshot

# Restore from snapshot to the data directory
sudo ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20250527.db \
  --data-dir=/var/lib/etcd

# Set correct permissions
sudo chown -R etcd:etcd /var/lib/etcd

# Verify data directory has been restored
ls -la /var/lib/etcd/
# total 24
# drwx------  4 etcd etcd 4096 May 27 10:00 .
# drwxr-xr-x  3 root root 4096 May 27 10:00 ..
# drwx------  3 etcd etcd 4096 May 27 10:00 member

Step 5: Restore control plane components

# Move etcd and API Server manifests back
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Wait for Pods to start (approximately 30-60 seconds)
sleep 60

Verification

# Verify etcd Pod is running
kubectl get pods -n kube-system | grep etcd
# etcd-control-plane-1    1/1     Running   0   1m

# Verify API Server is running
kubectl get pods -n kube-system | grep kube-apiserver
# kube-apiserver-control-plane-1    1/1     Running   0   1m

# Verify cluster resources have been restored
kubectl get nodes
kubectl get pods --all-namespaces
# All resources from before the restore should be visible

# Verify etcd health
kubectl exec -n kube-system etcd-control-plane-1 -- etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.245672ms

Exam Tips

  • Must set ETCDCTL_API=3 -- forgetting to set this causes etcdctl to use v2 API, making the snapshot command unavailable
  • Must stop the API Server before restoring -- prevents data writes during restore that would cause inconsistency
  • TLS certificate parameters cannot be omitted -- every etcdctl command needs --cacert, --cert, --key
  • Set permissions after restore -- sudo chown -R etcd:etcd /var/lib/etcd must not be forgotten, otherwise etcd cannot start
  • Use the -w table parameter for clearer etcdctl output

Official Documentation