调度约束

2026-05-27·CKA k8s 练习

nodeSelector、Node/Pod Affinity、Taints & Tolerations、PriorityClass

← 返回 CKA 练习目录

概述

调度约束控制 Pod 被分配到哪些节点运行。CKA 考试重点考察 Taints & Tolerations、Node Affinity 和 nodeSelector 的实际配置。

一、nodeSelector

最简单的节点选择方式，基于节点标签。

1.1 为节点打标签

kubectl get nodes --show-labels

# 添加标签
kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> gpu=true

# 删除标签
kubectl label nodes <node-name> disktype-

# 修改标签（--overwrite）
kubectl label nodes <node-name> disktype=hdd --overwrite

1.2 使用 nodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: nginx
    image: nginx

二、Node Affinity

比 nodeSelector 更灵活的节点选择，支持匹配表达式。

2.1 两种类型

类型	说明
`requiredDuringSchedulingIgnoredDuringExecution`	硬约束：必须满足才能调度（类似 nodeSelector，但支持表达式）
`preferredDuringSchedulingIgnoredDuringExecution`	软约束：尽量满足，不满足也调度

2.2 配置示例

apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
            - nvme
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1
      - weight: 20
        preference:
          matchExpressions:
          - key: gpu
            operator: Exists
  containers:
  - name: nginx
    image: nginx

2.3 匹配操作符

操作符	说明	示例
`In`	values 中任意一个匹配	disktype In [ssd, nvme]
`NotIn`	values 中任意一个不匹配	disktype NotIn [hdd]
`Exists`	key 存在（忽略 values）	gpu Exists
`DoesNotExist`	key 不存在	gpu DoesNotExist
`Gt`	值大于（数值比较）	memory Gt [32]
`Lt`	值小于（数值比较）	memory Lt [64]

2.4 命令式创建 Node Affinity

# 使用 kubectl run 后编辑 YAML 添加 affinity 部分
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity

三、Pod Affinity / Anti-Affinity

控制 Pod 与其它 Pod 的调度关系（同拓扑域 / 不同拓扑域）。

3.1 配置示例

apiVersion: v1
kind: Pod
metadata:
  name: pod-affinity-pod
  labels:
    app: frontend
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - backend
        topologyKey: "kubernetes.io/hostname"   # 在同一节点上
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - frontend
          topologyKey: "kubernetes.io/hostname"  # 不在同一节点
  containers:
  - name: nginx
    image: nginx

3.2 拓扑键（topologyKey）

键	说明
`kubernetes.io/hostname`	节点级别
`topology.kubernetes.io/zone`	可用区
`topology.kubernetes.io/region`	地域
`failure-domain.beta.kubernetes.io/zone`	旧版可用区

3.3 常见场景

Pod Affinity：将 Web 和 Cache Pod 调度到同一节点（减少网络延迟）
Pod Anti-Affinity：将相同应用的 Pod 分散到不同节点（高可用）

3.4 注意事项

Pod Affinity/Anti-Affinity 会增加调度器计算负担
requiredDuringScheduling 可能导致 Pod 无法调度
topologyKey 不能为空

四、Taints & Tolerations

Taint 设置节点拒绝 Pod 调度，Toleration 允许 Pod 绕过 Taint。

4.1 Taint 操作

# 查看节点 Taints
kubectl describe node <node-name> | grep Taints

# 添加 Taint（kubectl taint nodes <node> <key>=<value>:<effect>）
kubectl taint nodes node1 app=blue:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key1=value1:PreferNoSchedule

# 删除 Taint
kubectl taint nodes node1 app=blue:NoSchedule-
kubectl taint nodes node1 key1:NoExecute-    # 不需要指定 value

# 删除所有 Taints
kubectl taint nodes node1 app-               # 删除所有 app 相关的 taint

4.2 Taint Effect 类型

Effect	说明
`NoSchedule`	不容忍则不调度（不驱逐已有 Pod）
`PreferNoSchedule`	软约束：尽量不调度
`NoExecute`	不容忍则驱逐已有 Pod + 拒绝新 Pod

4.3 Toleration 配置

apiVersion: v1
kind: Pod
metadata:
  name: toleration-pod
spec:
  tolerations:
  - key: "app"
    operator: "Equal"
    value: "blue"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Exists"          # 匹配所有包含 key1 的 taint
    effect: "NoExecute"
    tolerationSeconds: 60       # 容忍 60 秒后被驱逐
  - operator: "Exists"          # 容忍所有 taints（慎用）
  containers:
  - name: nginx
    image: nginx

4.4 Toleration 操作符

操作符	说明	示例
`Equal`	key + value + effect 全匹配	key=app, value=blue, effect=NoSchedule
`Exists`	只要 key 匹配即容忍	无需 value
只有 operator（无 key）	容忍所有 taints	适用于 DaemonSet

4.5 常见场景

# 1. 专用节点（只运行特定 Pod）
kubectl taint nodes gpu-node dedicated=gpu:NoSchedule
# 只允许带 toleration 的 Pod 调度

# 2. 控制节点默认 Taint
kubectl describe node controlplane | grep Taints
# 输出：node-role.kubernetes.io/control-plane:NoSchedule

# 3. 节点故障处理
kubectl taint nodes node1 node.kubernetes.io/unreachable:NoExecute

4.6 考试技巧

# 将节点设置为不可调度（NoSchedule）
kubectl taint nodes worker1 env=production:NoSchedule

# 创建容忍该 taint 的 Pod
kubectl run toleration-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.tolerations

# 验证 Pod 调度到该节点
kubectl get pods -o wide | grep toleration-pod

# Master 节点运行普通 Pod
# 方法 1：删除 master 的 taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# 方法 2：添加 toleration（推荐，不影响控制平面）

五、PriorityClass

PriorityClass 设置 Pod 优先级，高优先级 Pod 可以抢占低优先级 Pod。

5.1 创建 PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000              # 值越大优先级越高
globalDefault: false        # 是否为默认 PriorityClass
description: "High priority Pods"

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority Pods"

# 创建 PriorityClass
kubectl apply -f priorityclass.yaml

# 查看
kubectl get priorityclass
kubectl get pc

5.2 在 Pod 中使用

apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: 500m
        memory: 512Mi

5.3 常见 PriorityClass 值

名称	值	说明
`system-cluster-critical`	2000000000	集群关键组件
`system-node-critical`	2000001000	节点关键组件
自定义 (高)	1000000	高优先级应用
自定义 (低)	100	低优先级批处理任务

六、Pod 调度流程

节点过滤（Filtering / Predicates）：
- 检查节点资源是否满足 Pod requests
- 检查 nodeSelector、Node Affinity
- 检查 Taints & Tolerations
- 检查 Pod Affinity/Anti-Affinity
节点打分（Scoring / Priorities）：
- 资源利用率（已有资源/总资源）
- Node Affinity weight
- Pod 分散性（Anti-Affinity）
绑定（Binding）：调度器将 Pod 绑定到选定节点

# 查看调度器日志
kubectl logs -n kube-system kube-scheduler-controlplane

# 查看调度事件
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule

# 查看 Pod 被调度到的节点
kubectl get pods -o wide
kubectl get pod <pod-name> -o wide

七、考试实用命令

# 1. 查看所有节点标签
kubectl get nodes --show-labels

# 2. 查看节点 Taints
kubectl describe nodes | grep -A 5 Taints

# 3. 将 Pod 调度到 Master 节点
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# 4. 创建带 Node Affinity 的 Pod（用 generate name）
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity

# 5. 为 DaemonSet 添加 Toleration（常用）
kubectl get ds fluentd -o yaml > ds.yaml
# 在 ds.yaml 的 spec.template.spec 中添加 tolerations

# 6. 检查 Pod 是否因为 Taint 无法调度
kubectl describe pod <pod-name> | grep -A 10 Events
# 输出: 0/3 nodes are available: 1 node(s) had taint, 2 node(s) didn't match pod anti-affinity

# 7. 快速创建 PriorityClass
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: important
value: 100000
globalDefault: false
EOF

🧪 完整操作实例：使用 Taints/Tolerations 和 NodeAffinity 控制 Pod 调度

场景描述

给节点打污点控制特定 Pod 调度，同时配置 nodeAffinity 实现更精细的调度约束。

前置条件

多节点的 Kubernetes 集群（至少 2 个 worker 节点）
kubectl 已配置连接集群
集群中有 node1 和 node2（如节点名不同，请替换命令中的节点名）

操作步骤

Step 1: 给节点打标签和污点

# 查看节点
kubectl get nodes
# 预期输出：NAME       STATUS   ROLES           AGE   VERSION
#          controlplane   Ready    control-plane   10m   v1.29
#          node1          Ready    <none>          9m    v1.29
#          node2          Ready    <none>          9m    v1.29

# 给 node1 打标签
kubectl label nodes node1 disktype=ssd
# 预期输出：node/node1 labeled

# 给 node1 打污点（只允许容忍此污点的 Pod 调度）
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# 预期输出：node/node1 tainted

Step 2: 创建带 Toleration 的 Deployment（可以调度到 node1）

cat <<'EOF' > deploy-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-workload
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-tolerate.yaml
# 预期输出：deployment.apps/gpu-workload created

kubectl get pods -l app=gpu-app -o wide
# 预期输出：Pod 被调度到 node1（有 toleration）
# NAME                            READY   STATUS    RESTARTS   AGE   NODE
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1

Step 3: 创建不带 Toleration 的 Deployment（无法调度）

cat <<'EOF' > deploy-no-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-no-tolerate.yaml
# 预期输出：deployment.apps/normal-workload created

kubectl get pods -l app=normal
# 预期输出：Pod 会显示 Pending 状态
# NAME                              READY   STATUS    RESTARTS   AGE
# normal-workload-<hash>-<pod-id>   0/1     Pending   0          <seconds>

Step 4: 分析调度失败原因

kubectl describe pod -l app=normal | grep -A 10 Events
# 预期输出显示：
# Events:
#   Type     Reason            Age   From               Message
#   ----     ------            ----  ----               -------
#   Warning  FailedScheduling  10s   default-scheduler  0/2 nodes are available: 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate, 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate.

Step 5: 添加 Node Affinity（软约束）

# 先删除待处理的 Pod，更新 Deployment
kubectl delete deployment normal-workload
# 预期输出：deployment.apps "normal-workload" deleted

cat <<'EOF' > deploy-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-affinity.yaml
# 预期输出：deployment.apps/normal-workload created

kubectl get pods -l app=normal -o wide
# 预期输出：Pod 被调度到 node2（没有 disktype=ssd 标签，但软约束不强制要求）
# NAME                              READY   STATUS    RESTARTS   AGE   NODE
# normal-workload-<hash>-<pod-id>   1/1     Running   0          <s>   node2

验证结果

# 确认 gpu-workload 运行在 node1
kubectl get pods -l app=gpu-app -o wide | grep node1
# 预期输出：gpu-workload Pod 显示 node1

# 确认 normal-workload 运行在 node2
kubectl get pods -l app=normal -o wide | grep node2
# 预期输出：normal-workload Pod 显示 node2

# 清理
kubectl delete deployment gpu-workload normal-workload
kubectl taint nodes node1 dedicated=gpu:NoSchedule-
kubectl label nodes node1 disktype-
# 预期输出：所有资源已清理

考试提示

NoSchedule 只影响新 Pod，不影响已有 Pod；NoExecute 同时驱逐已有 Pod
Toleration 的 operator: Exists 带 key 匹配所有值；不含 key 的 operator: Exists 匹配所有污点
Node Affinity 的 requiredDuringScheduling 是硬约束，不满足时 Pod 无法调度
考试中如果 Pod 处于 Pending 状态，首先用 kubectl describe pod 查看 Events 确定原因
Master 节点默认有 node-role.kubernetes.io/control-plane:NoSchedule 污点，删除它可在 Master 上调度 Pod
创建 DaemonSet 时通常需要添加对所有污点的容忍：- operator: Exists

概述