Qingular

调度约束

·CKAk8s练习

nodeSelector、Node/Pod Affinity、Taints & Tolerations、PriorityClass

← 返回 CKA 练习目录

概述

调度约束控制 Pod 被分配到哪些节点运行。CKA 考试重点考察 Taints & Tolerations、Node Affinity 和 nodeSelector 的实际配置。


一、nodeSelector

最简单的节点选择方式,基于节点标签。

1.1 为节点打标签

kubectl get nodes --show-labels

# 添加标签
kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> gpu=true

# 删除标签
kubectl label nodes <node-name> disktype-

# 修改标签(--overwrite)
kubectl label nodes <node-name> disktype=hdd --overwrite

1.2 使用 nodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: nginx
    image: nginx

二、Node Affinity

比 nodeSelector 更灵活的节点选择,支持匹配表达式。

2.1 两种类型

类型说明
requiredDuringSchedulingIgnoredDuringExecution硬约束:必须满足才能调度(类似 nodeSelector,但支持表达式)
preferredDuringSchedulingIgnoredDuringExecution软约束:尽量满足,不满足也调度

2.2 配置示例

apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
            - nvme
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1
      - weight: 20
        preference:
          matchExpressions:
          - key: gpu
            operator: Exists
  containers:
  - name: nginx
    image: nginx

2.3 匹配操作符

操作符说明示例
Invalues 中任意一个匹配disktype In [ssd, nvme]
NotInvalues 中任意一个不匹配disktype NotIn [hdd]
Existskey 存在(忽略 values)gpu Exists
DoesNotExistkey 不存在gpu DoesNotExist
Gt值大于(数值比较)memory Gt [32]
Lt值小于(数值比较)memory Lt [64]

2.4 命令式创建 Node Affinity

# 使用 kubectl run 后编辑 YAML 添加 affinity 部分
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity

三、Pod Affinity / Anti-Affinity

控制 Pod 与其它 Pod 的调度关系(同拓扑域 / 不同拓扑域)。

3.1 配置示例

apiVersion: v1
kind: Pod
metadata:
  name: pod-affinity-pod
  labels:
    app: frontend
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - backend
        topologyKey: "kubernetes.io/hostname"   # 在同一节点上
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - frontend
          topologyKey: "kubernetes.io/hostname"  # 不在同一节点
  containers:
  - name: nginx
    image: nginx

3.2 拓扑键(topologyKey)

说明
kubernetes.io/hostname节点级别
topology.kubernetes.io/zone可用区
topology.kubernetes.io/region地域
failure-domain.beta.kubernetes.io/zone旧版可用区

3.3 常见场景

  • Pod Affinity:将 Web 和 Cache Pod 调度到同一节点(减少网络延迟)
  • Pod Anti-Affinity:将相同应用的 Pod 分散到不同节点(高可用)

3.4 注意事项

  • Pod Affinity/Anti-Affinity 会增加调度器计算负担
  • requiredDuringScheduling 可能导致 Pod 无法调度
  • topologyKey 不能为空

四、Taints & Tolerations

Taint 设置节点拒绝 Pod 调度,Toleration 允许 Pod 绕过 Taint。

4.1 Taint 操作

# 查看节点 Taints
kubectl describe node <node-name> | grep Taints

# 添加 Taint(kubectl taint nodes <node> <key>=<value>:<effect>)
kubectl taint nodes node1 app=blue:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key1=value1:PreferNoSchedule

# 删除 Taint
kubectl taint nodes node1 app=blue:NoSchedule-
kubectl taint nodes node1 key1:NoExecute-    # 不需要指定 value

# 删除所有 Taints
kubectl taint nodes node1 app-               # 删除所有 app 相关的 taint

4.2 Taint Effect 类型

Effect说明
NoSchedule不容忍则不调度(不驱逐已有 Pod)
PreferNoSchedule软约束:尽量不调度
NoExecute不容忍则驱逐已有 Pod + 拒绝新 Pod

4.3 Toleration 配置

apiVersion: v1
kind: Pod
metadata:
  name: toleration-pod
spec:
  tolerations:
  - key: "app"
    operator: "Equal"
    value: "blue"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Exists"          # 匹配所有包含 key1 的 taint
    effect: "NoExecute"
    tolerationSeconds: 60       # 容忍 60 秒后被驱逐
  - operator: "Exists"          # 容忍所有 taints(慎用)
  containers:
  - name: nginx
    image: nginx

4.4 Toleration 操作符

操作符说明示例
Equalkey + value + effect 全匹配key=app, value=blue, effect=NoSchedule
Exists只要 key 匹配即容忍无需 value
只有 operator(无 key)容忍所有 taints适用于 DaemonSet

4.5 常见场景

# 1. 专用节点(只运行特定 Pod)
kubectl taint nodes gpu-node dedicated=gpu:NoSchedule
# 只允许带 toleration 的 Pod 调度

# 2. 控制节点默认 Taint
kubectl describe node controlplane | grep Taints
# 输出:node-role.kubernetes.io/control-plane:NoSchedule

# 3. 节点故障处理
kubectl taint nodes node1 node.kubernetes.io/unreachable:NoExecute

4.6 考试技巧

# 将节点设置为不可调度(NoSchedule)
kubectl taint nodes worker1 env=production:NoSchedule

# 创建容忍该 taint 的 Pod
kubectl run toleration-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.tolerations

# 验证 Pod 调度到该节点
kubectl get pods -o wide | grep toleration-pod

# Master 节点运行普通 Pod
# 方法 1:删除 master 的 taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# 方法 2:添加 toleration(推荐,不影响控制平面)

五、PriorityClass

PriorityClass 设置 Pod 优先级,高优先级 Pod 可以抢占低优先级 Pod。

5.1 创建 PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000              # 值越大优先级越高
globalDefault: false        # 是否为默认 PriorityClass
description: "High priority Pods"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority Pods"
# 创建 PriorityClass
kubectl apply -f priorityclass.yaml

# 查看
kubectl get priorityclass
kubectl get pc

5.2 在 Pod 中使用

apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        cpu: 500m
        memory: 512Mi

5.3 常见 PriorityClass 值

名称说明
system-cluster-critical2000000000集群关键组件
system-node-critical2000001000节点关键组件
自定义 (高)1000000高优先级应用
自定义 (低)100低优先级批处理任务

六、Pod 调度流程

  1. 节点过滤(Filtering / Predicates):

    • 检查节点资源是否满足 Pod requests
    • 检查 nodeSelector、Node Affinity
    • 检查 Taints & Tolerations
    • 检查 Pod Affinity/Anti-Affinity
  2. 节点打分(Scoring / Priorities):

    • 资源利用率(已有资源/总资源)
    • Node Affinity weight
    • Pod 分散性(Anti-Affinity)
  3. 绑定(Binding):调度器将 Pod 绑定到选定节点

# 查看调度器日志
kubectl logs -n kube-system kube-scheduler-controlplane

# 查看调度事件
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule

# 查看 Pod 被调度到的节点
kubectl get pods -o wide
kubectl get pod <pod-name> -o wide

七、考试实用命令

# 1. 查看所有节点标签
kubectl get nodes --show-labels

# 2. 查看节点 Taints
kubectl describe nodes | grep -A 5 Taints

# 3. 将 Pod 调度到 Master 节点
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-

# 4. 创建带 Node Affinity 的 Pod(用 generate name)
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity

# 5. 为 DaemonSet 添加 Toleration(常用)
kubectl get ds fluentd -o yaml > ds.yaml
# 在 ds.yaml 的 spec.template.spec 中添加 tolerations

# 6. 检查 Pod 是否因为 Taint 无法调度
kubectl describe pod <pod-name> | grep -A 10 Events
# 输出: 0/3 nodes are available: 1 node(s) had taint, 2 node(s) didn't match pod anti-affinity

# 7. 快速创建 PriorityClass
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: important
value: 100000
globalDefault: false
EOF

🧪 完整操作实例:使用 Taints/Tolerations 和 NodeAffinity 控制 Pod 调度

场景描述

给节点打污点控制特定 Pod 调度,同时配置 nodeAffinity 实现更精细的调度约束。

前置条件

  • 多节点的 Kubernetes 集群(至少 2 个 worker 节点)
  • kubectl 已配置连接集群
  • 集群中有 node1node2(如节点名不同,请替换命令中的节点名)

操作步骤

Step 1: 给节点打标签和污点

# 查看节点
kubectl get nodes
# 预期输出:NAME       STATUS   ROLES           AGE   VERSION
#          controlplane   Ready    control-plane   10m   v1.29
#          node1          Ready    <none>          9m    v1.29
#          node2          Ready    <none>          9m    v1.29

# 给 node1 打标签
kubectl label nodes node1 disktype=ssd
# 预期输出:node/node1 labeled

# 给 node1 打污点(只允许容忍此污点的 Pod 调度)
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# 预期输出:node/node1 tainted

Step 2: 创建带 Toleration 的 Deployment(可以调度到 node1)

cat <<'EOF' > deploy-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-workload
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-tolerate.yaml
# 预期输出:deployment.apps/gpu-workload created

kubectl get pods -l app=gpu-app -o wide
# 预期输出:Pod 被调度到 node1(有 toleration)
# NAME                            READY   STATUS    RESTARTS   AGE   NODE
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1
# gpu-workload-<hash>-<pod-id>    1/1     Running   0          <s>   node1

Step 3: 创建不带 Toleration 的 Deployment(无法调度)

cat <<'EOF' > deploy-no-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-no-tolerate.yaml
# 预期输出:deployment.apps/normal-workload created

kubectl get pods -l app=normal
# 预期输出:Pod 会显示 Pending 状态
# NAME                              READY   STATUS    RESTARTS   AGE
# normal-workload-<hash>-<pod-id>   0/1     Pending   0          <seconds>

Step 4: 分析调度失败原因

kubectl describe pod -l app=normal | grep -A 10 Events
# 预期输出显示:
# Events:
#   Type     Reason            Age   From               Message
#   ----     ------            ----  ----               -------
#   Warning  FailedScheduling  10s   default-scheduler  0/2 nodes are available: 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate, 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate.

Step 5: 添加 Node Affinity(软约束)

# 先删除待处理的 Pod,更新 Deployment
kubectl delete deployment normal-workload
# 预期输出:deployment.apps "normal-workload" deleted

cat <<'EOF' > deploy-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal
  template:
    metadata:
      labels:
        app: normal
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
      containers:
      - name: nginx
        image: nginx
EOF

kubectl apply -f deploy-affinity.yaml
# 预期输出:deployment.apps/normal-workload created

kubectl get pods -l app=normal -o wide
# 预期输出:Pod 被调度到 node2(没有 disktype=ssd 标签,但软约束不强制要求)
# NAME                              READY   STATUS    RESTARTS   AGE   NODE
# normal-workload-<hash>-<pod-id>   1/1     Running   0          <s>   node2

验证结果

# 确认 gpu-workload 运行在 node1
kubectl get pods -l app=gpu-app -o wide | grep node1
# 预期输出:gpu-workload Pod 显示 node1

# 确认 normal-workload 运行在 node2
kubectl get pods -l app=normal -o wide | grep node2
# 预期输出:normal-workload Pod 显示 node2

# 清理
kubectl delete deployment gpu-workload normal-workload
kubectl taint nodes node1 dedicated=gpu:NoSchedule-
kubectl label nodes node1 disktype-
# 预期输出:所有资源已清理

考试提示

  • NoSchedule 只影响新 Pod,不影响已有 Pod;NoExecute 同时驱逐已有 Pod
  • Toleration 的 operator: Exists 带 key 匹配所有值;不含 key 的 operator: Exists 匹配所有污点
  • Node Affinity 的 requiredDuringScheduling 是硬约束,不满足时 Pod 无法调度
  • 考试中如果 Pod 处于 Pending 状态,首先用 kubectl describe pod 查看 Events 确定原因
  • Master 节点默认有 node-role.kubernetes.io/control-plane:NoSchedule 污点,删除它可在 Master 上调度 Pod
  • 创建 DaemonSet 时通常需要添加对所有污点的容忍:- operator: Exists

官方文档链接