调度约束
nodeSelector、Node/Pod Affinity、Taints & Tolerations、PriorityClass
概述
调度约束控制 Pod 被分配到哪些节点运行。CKA 考试重点考察 Taints & Tolerations、Node Affinity 和 nodeSelector 的实际配置。
一、nodeSelector
最简单的节点选择方式,基于节点标签。
1.1 为节点打标签
kubectl get nodes --show-labels
# 添加标签
kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> gpu=true
# 删除标签
kubectl label nodes <node-name> disktype-
# 修改标签(--overwrite)
kubectl label nodes <node-name> disktype=hdd --overwrite
1.2 使用 nodeSelector
apiVersion: v1
kind: Pod
metadata:
name: ssd-pod
spec:
nodeSelector:
disktype: ssd
containers:
- name: nginx
image: nginx
二、Node Affinity
比 nodeSelector 更灵活的节点选择,支持匹配表达式。
2.1 两种类型
| 类型 | 说明 |
|---|---|
requiredDuringSchedulingIgnoredDuringExecution | 硬约束:必须满足才能调度(类似 nodeSelector,但支持表达式) |
preferredDuringSchedulingIgnoredDuringExecution | 软约束:尽量满足,不满足也调度 |
2.2 配置示例
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
- nvme
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1
- weight: 20
preference:
matchExpressions:
- key: gpu
operator: Exists
containers:
- name: nginx
image: nginx
2.3 匹配操作符
| 操作符 | 说明 | 示例 |
|---|---|---|
In | values 中任意一个匹配 | disktype In [ssd, nvme] |
NotIn | values 中任意一个不匹配 | disktype NotIn [hdd] |
Exists | key 存在(忽略 values) | gpu Exists |
DoesNotExist | key 不存在 | gpu DoesNotExist |
Gt | 值大于(数值比较) | memory Gt [32] |
Lt | 值小于(数值比较) | memory Lt [64] |
2.4 命令式创建 Node Affinity
# 使用 kubectl run 后编辑 YAML 添加 affinity 部分
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity
三、Pod Affinity / Anti-Affinity
控制 Pod 与其它 Pod 的调度关系(同拓扑域 / 不同拓扑域)。
3.1 配置示例
apiVersion: v1
kind: Pod
metadata:
name: pod-affinity-pod
labels:
app: frontend
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- backend
topologyKey: "kubernetes.io/hostname" # 在同一节点上
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- frontend
topologyKey: "kubernetes.io/hostname" # 不在同一节点
containers:
- name: nginx
image: nginx
3.2 拓扑键(topologyKey)
| 键 | 说明 |
|---|---|
kubernetes.io/hostname | 节点级别 |
topology.kubernetes.io/zone | 可用区 |
topology.kubernetes.io/region | 地域 |
failure-domain.beta.kubernetes.io/zone | 旧版可用区 |
3.3 常见场景
- Pod Affinity:将 Web 和 Cache Pod 调度到同一节点(减少网络延迟)
- Pod Anti-Affinity:将相同应用的 Pod 分散到不同节点(高可用)
3.4 注意事项
- Pod Affinity/Anti-Affinity 会增加调度器计算负担
requiredDuringScheduling可能导致 Pod 无法调度topologyKey不能为空
四、Taints & Tolerations
Taint 设置节点拒绝 Pod 调度,Toleration 允许 Pod 绕过 Taint。
4.1 Taint 操作
# 查看节点 Taints
kubectl describe node <node-name> | grep Taints
# 添加 Taint(kubectl taint nodes <node> <key>=<value>:<effect>)
kubectl taint nodes node1 app=blue:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key1=value1:PreferNoSchedule
# 删除 Taint
kubectl taint nodes node1 app=blue:NoSchedule-
kubectl taint nodes node1 key1:NoExecute- # 不需要指定 value
# 删除所有 Taints
kubectl taint nodes node1 app- # 删除所有 app 相关的 taint
4.2 Taint Effect 类型
| Effect | 说明 |
|---|---|
NoSchedule | 不容忍则不调度(不驱逐已有 Pod) |
PreferNoSchedule | 软约束:尽量不调度 |
NoExecute | 不容忍则驱逐已有 Pod + 拒绝新 Pod |
4.3 Toleration 配置
apiVersion: v1
kind: Pod
metadata:
name: toleration-pod
spec:
tolerations:
- key: "app"
operator: "Equal"
value: "blue"
effect: "NoSchedule"
- key: "key1"
operator: "Exists" # 匹配所有包含 key1 的 taint
effect: "NoExecute"
tolerationSeconds: 60 # 容忍 60 秒后被驱逐
- operator: "Exists" # 容忍所有 taints(慎用)
containers:
- name: nginx
image: nginx
4.4 Toleration 操作符
| 操作符 | 说明 | 示例 |
|---|---|---|
Equal | key + value + effect 全匹配 | key=app, value=blue, effect=NoSchedule |
Exists | 只要 key 匹配即容忍 | 无需 value |
| 只有 operator(无 key) | 容忍所有 taints | 适用于 DaemonSet |
4.5 常见场景
# 1. 专用节点(只运行特定 Pod)
kubectl taint nodes gpu-node dedicated=gpu:NoSchedule
# 只允许带 toleration 的 Pod 调度
# 2. 控制节点默认 Taint
kubectl describe node controlplane | grep Taints
# 输出:node-role.kubernetes.io/control-plane:NoSchedule
# 3. 节点故障处理
kubectl taint nodes node1 node.kubernetes.io/unreachable:NoExecute
4.6 考试技巧
# 将节点设置为不可调度(NoSchedule)
kubectl taint nodes worker1 env=production:NoSchedule
# 创建容忍该 taint 的 Pod
kubectl run toleration-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.tolerations
# 验证 Pod 调度到该节点
kubectl get pods -o wide | grep toleration-pod
# Master 节点运行普通 Pod
# 方法 1:删除 master 的 taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-
# 方法 2:添加 toleration(推荐,不影响控制平面)
五、PriorityClass
PriorityClass 设置 Pod 优先级,高优先级 Pod 可以抢占低优先级 Pod。
5.1 创建 PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # 值越大优先级越高
globalDefault: false # 是否为默认 PriorityClass
description: "High priority Pods"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority Pods"
# 创建 PriorityClass
kubectl apply -f priorityclass.yaml
# 查看
kubectl get priorityclass
kubectl get pc
5.2 在 Pod 中使用
apiVersion: v1
kind: Pod
metadata:
name: high-priority-pod
spec:
priorityClassName: high-priority
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: 500m
memory: 512Mi
5.3 常见 PriorityClass 值
| 名称 | 值 | 说明 |
|---|---|---|
system-cluster-critical | 2000000000 | 集群关键组件 |
system-node-critical | 2000001000 | 节点关键组件 |
| 自定义 (高) | 1000000 | 高优先级应用 |
| 自定义 (低) | 100 | 低优先级批处理任务 |
六、Pod 调度流程
-
节点过滤(Filtering / Predicates):
- 检查节点资源是否满足 Pod requests
- 检查 nodeSelector、Node Affinity
- 检查 Taints & Tolerations
- 检查 Pod Affinity/Anti-Affinity
-
节点打分(Scoring / Priorities):
- 资源利用率(已有资源/总资源)
- Node Affinity weight
- Pod 分散性(Anti-Affinity)
-
绑定(Binding):调度器将 Pod 绑定到选定节点
# 查看调度器日志
kubectl logs -n kube-system kube-scheduler-controlplane
# 查看调度事件
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule
# 查看 Pod 被调度到的节点
kubectl get pods -o wide
kubectl get pod <pod-name> -o wide
七、考试实用命令
# 1. 查看所有节点标签
kubectl get nodes --show-labels
# 2. 查看节点 Taints
kubectl describe nodes | grep -A 5 Taints
# 3. 将 Pod 调度到 Master 节点
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-
# 4. 创建带 Node Affinity 的 Pod(用 generate name)
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# 编辑添加 spec.affinity.nodeAffinity
# 5. 为 DaemonSet 添加 Toleration(常用)
kubectl get ds fluentd -o yaml > ds.yaml
# 在 ds.yaml 的 spec.template.spec 中添加 tolerations
# 6. 检查 Pod 是否因为 Taint 无法调度
kubectl describe pod <pod-name> | grep -A 10 Events
# 输出: 0/3 nodes are available: 1 node(s) had taint, 2 node(s) didn't match pod anti-affinity
# 7. 快速创建 PriorityClass
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: important
value: 100000
globalDefault: false
EOF
🧪 完整操作实例:使用 Taints/Tolerations 和 NodeAffinity 控制 Pod 调度
场景描述
给节点打污点控制特定 Pod 调度,同时配置 nodeAffinity 实现更精细的调度约束。
前置条件
- 多节点的 Kubernetes 集群(至少 2 个 worker 节点)
- kubectl 已配置连接集群
- 集群中有
node1和node2(如节点名不同,请替换命令中的节点名)
操作步骤
Step 1: 给节点打标签和污点
# 查看节点
kubectl get nodes
# 预期输出:NAME STATUS ROLES AGE VERSION
# controlplane Ready control-plane 10m v1.29
# node1 Ready <none> 9m v1.29
# node2 Ready <none> 9m v1.29
# 给 node1 打标签
kubectl label nodes node1 disktype=ssd
# 预期输出:node/node1 labeled
# 给 node1 打污点(只允许容忍此污点的 Pod 调度)
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# 预期输出:node/node1 tainted
Step 2: 创建带 Toleration 的 Deployment(可以调度到 node1)
cat <<'EOF' > deploy-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-workload
spec:
replicas: 2
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-tolerate.yaml
# 预期输出:deployment.apps/gpu-workload created
kubectl get pods -l app=gpu-app -o wide
# 预期输出:Pod 被调度到 node1(有 toleration)
# NAME READY STATUS RESTARTS AGE NODE
# gpu-workload-<hash>-<pod-id> 1/1 Running 0 <s> node1
# gpu-workload-<hash>-<pod-id> 1/1 Running 0 <s> node1
Step 3: 创建不带 Toleration 的 Deployment(无法调度)
cat <<'EOF' > deploy-no-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: normal-workload
spec:
replicas: 1
selector:
matchLabels:
app: normal
template:
metadata:
labels:
app: normal
spec:
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-no-tolerate.yaml
# 预期输出:deployment.apps/normal-workload created
kubectl get pods -l app=normal
# 预期输出:Pod 会显示 Pending 状态
# NAME READY STATUS RESTARTS AGE
# normal-workload-<hash>-<pod-id> 0/1 Pending 0 <seconds>
Step 4: 分析调度失败原因
kubectl describe pod -l app=normal | grep -A 10 Events
# 预期输出显示:
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 10s default-scheduler 0/2 nodes are available: 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate, 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate.
Step 5: 添加 Node Affinity(软约束)
# 先删除待处理的 Pod,更新 Deployment
kubectl delete deployment normal-workload
# 预期输出:deployment.apps "normal-workload" deleted
cat <<'EOF' > deploy-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: normal-workload
spec:
replicas: 1
selector:
matchLabels:
app: normal
template:
metadata:
labels:
app: normal
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-affinity.yaml
# 预期输出:deployment.apps/normal-workload created
kubectl get pods -l app=normal -o wide
# 预期输出:Pod 被调度到 node2(没有 disktype=ssd 标签,但软约束不强制要求)
# NAME READY STATUS RESTARTS AGE NODE
# normal-workload-<hash>-<pod-id> 1/1 Running 0 <s> node2
验证结果
# 确认 gpu-workload 运行在 node1
kubectl get pods -l app=gpu-app -o wide | grep node1
# 预期输出:gpu-workload Pod 显示 node1
# 确认 normal-workload 运行在 node2
kubectl get pods -l app=normal -o wide | grep node2
# 预期输出:normal-workload Pod 显示 node2
# 清理
kubectl delete deployment gpu-workload normal-workload
kubectl taint nodes node1 dedicated=gpu:NoSchedule-
kubectl label nodes node1 disktype-
# 预期输出:所有资源已清理
考试提示
NoSchedule只影响新 Pod,不影响已有 Pod;NoExecute同时驱逐已有 Pod- Toleration 的
operator: Exists带 key 匹配所有值;不含 key 的operator: Exists匹配所有污点 - Node Affinity 的
requiredDuringScheduling是硬约束,不满足时 Pod 无法调度 - 考试中如果 Pod 处于 Pending 状态,首先用
kubectl describe pod查看 Events 确定原因 - Master 节点默认有
node-role.kubernetes.io/control-plane:NoSchedule污点,删除它可在 Master 上调度 Pod - 创建 DaemonSet 时通常需要添加对所有污点的容忍:
- operator: Exists