Scheduling Constraints
nodeSelector, Node/Pod Affinity, Taints & Tolerations, PriorityClass
Overview
Scheduling constraints control which nodes Pods are assigned to. The CKA exam focuses on practical configuration of Taints & Tolerations, Node Affinity, and nodeSelector.
1. nodeSelector
The simplest node selection method, based on node labels.
1.1 Labeling Nodes
kubectl get nodes --show-labels
# Add label
kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> gpu=true
# Remove label
kubectl label nodes <node-name> disktype-
# Modify label (--overwrite)
kubectl label nodes <node-name> disktype=hdd --overwrite
1.2 Using nodeSelector
apiVersion: v1
kind: Pod
metadata:
name: ssd-pod
spec:
nodeSelector:
disktype: ssd
containers:
- name: nginx
image: nginx
2. Node Affinity
More flexible node selection than nodeSelector, supports match expressions.
2.1 Two Types
| Type | Description |
|---|---|
requiredDuringSchedulingIgnoredDuringExecution | Hard constraint: must be met for scheduling (similar to nodeSelector but supports expressions) |
preferredDuringSchedulingIgnoredDuringExecution | Soft constraint: best-effort, schedules even if not met |
2.2 Configuration Example
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
- nvme
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1
- weight: 20
preference:
matchExpressions:
- key: gpu
operator: Exists
containers:
- name: nginx
image: nginx
2.3 Match Operators
| Operator | Description | Example |
|---|---|---|
In | Matches any value in values | disktype In [ssd, nvme] |
NotIn | Does not match any value in values | disktype NotIn [hdd] |
Exists | Key exists (values ignored) | gpu Exists |
DoesNotExist | Key does not exist | gpu DoesNotExist |
Gt | Value greater than (numeric comparison) | memory Gt [32] |
Lt | Value less than (numeric comparison) | memory Lt [64] |
2.4 Imperative Creation of Node Affinity
# Use kubectl run then edit YAML to add the affinity section
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# Edit to add spec.affinity.nodeAffinity
3. Pod Affinity / Anti-Affinity
Controls the scheduling relationship between Pods (same topology domain / different topology domain).
3.1 Configuration Example
apiVersion: v1
kind: Pod
metadata:
name: pod-affinity-pod
labels:
app: frontend
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- backend
topologyKey: "kubernetes.io/hostname" # On the same node
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- frontend
topologyKey: "kubernetes.io/hostname" # Not on the same node
containers:
- name: nginx
image: nginx
3.2 Topology Keys (topologyKey)
| Key | Description |
|---|---|
kubernetes.io/hostname | Node level |
topology.kubernetes.io/zone | Availability zone |
topology.kubernetes.io/region | Region |
failure-domain.beta.kubernetes.io/zone | Legacy availability zone |
3.3 Common Use Cases
- Pod Affinity: Schedule Web and Cache Pods on the same node (reduce network latency)
- Pod Anti-Affinity: Spread Pods of the same application across different nodes (high availability)
3.4 Notes
- Pod Affinity/Anti-Affinity increases scheduler computation overhead
requiredDuringSchedulingmay cause Pods to be unschedulabletopologyKeycannot be empty
4. Taints & Tolerations
Taints mark nodes to reject Pod scheduling, Tolerations allow Pods to bypass Taints.
4.1 Taint Operations
# View node Taints
kubectl describe node <node-name> | grep Taints
# Add Taint (kubectl taint nodes <node> <key>=<value>:<effect>)
kubectl taint nodes node1 app=blue:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key1=value1:PreferNoSchedule
# Remove Taint
kubectl taint nodes node1 app=blue:NoSchedule-
kubectl taint nodes node1 key1:NoExecute- # No need to specify value
# Remove all Taints
kubectl taint nodes node1 app- # Remove all taints related to 'app'
4.2 Taint Effect Types
| Effect | Description |
|---|---|
NoSchedule | Not tolerated means not scheduled (does not evict existing Pods) |
PreferNoSchedule | Soft constraint: best-effort to not schedule |
NoExecute | Not tolerated means evict existing Pods + reject new Pods |
4.3 Toleration Configuration
apiVersion: v1
kind: Pod
metadata:
name: toleration-pod
spec:
tolerations:
- key: "app"
operator: "Equal"
value: "blue"
effect: "NoSchedule"
- key: "key1"
operator: "Exists" # Matches all taints containing key1
effect: "NoExecute"
tolerationSeconds: 60 # Tolerate for 60 seconds before eviction
- operator: "Exists" # Tolerate all taints (use with caution)
containers:
- name: nginx
image: nginx
4.4 Toleration Operators
| Operator | Description | Example |
|---|---|---|
Equal | Full match on key + value + effect | key=app, value=blue, effect=NoSchedule |
Exists | Tolerates as long as key matches | No value needed |
| Only operator (no key) | Tolerates all taints | Suitable for DaemonSet |
4.5 Common Use Cases
# 1. Dedicated node (only run specific Pods)
kubectl taint nodes gpu-node dedicated=gpu:NoSchedule
# Only allow Pods with toleration to be scheduled
# 2. Control plane node default Taint
kubectl describe node controlplane | grep Taints
# Output: node-role.kubernetes.io/control-plane:NoSchedule
# 3. Node failure handling
kubectl taint nodes node1 node.kubernetes.io/unreachable:NoExecute
4.6 Exam Tips
# Set node as non-schedulable (NoSchedule)
kubectl taint nodes worker1 env=production:NoSchedule
# Create a Pod that tolerates this taint
kubectl run toleration-pod --image=nginx --dry-run=client -o yaml > pod.yaml
# Edit to add spec.tolerations
# Verify the Pod is scheduled to that node
kubectl get pods -o wide | grep toleration-pod
# Running regular Pods on master nodes
# Method 1: Remove the master's taint
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-
# Method 2: Add toleration (recommended, doesn't affect the control plane)
5. PriorityClass
PriorityClass sets Pod priority; higher priority Pods can preempt lower priority Pods.
5.1 Create PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # Higher value means higher priority
globalDefault: false # Whether this is the default PriorityClass
description: "High priority Pods"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority Pods"
# Create PriorityClass
kubectl apply -f priorityclass.yaml
# View
kubectl get priorityclass
kubectl get pc
5.2 Using in Pods
apiVersion: v1
kind: Pod
metadata:
name: high-priority-pod
spec:
priorityClassName: high-priority
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: 500m
memory: 512Mi
5.3 Common PriorityClass Values
| Name | Value | Description |
|---|---|---|
system-cluster-critical | 2000000000 | Cluster-critical components |
system-node-critical | 2000001000 | Node-critical components |
| Custom (high) | 1000000 | High priority applications |
| Custom (low) | 100 | Low priority batch tasks |
6. Pod Scheduling Flow
-
Node Filtering (Filtering / Predicates):
- Check if node resources satisfy Pod requests
- Check nodeSelector, Node Affinity
- Check Taints & Tolerations
- Check Pod Affinity/Anti-Affinity
-
Node Scoring (Scoring / Priorities):
- Resource utilization (used resources / total resources)
- Node Affinity weight
- Pod dispersion (Anti-Affinity)
-
Binding: The scheduler binds the Pod to the selected node
# View scheduler logs
kubectl logs -n kube-system kube-scheduler-controlplane
# View scheduling events
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule
# View the node a Pod is scheduled to
kubectl get pods -o wide
kubectl get pod <pod-name> -o wide
7. Useful Exam Commands
# 1. View all node labels
kubectl get nodes --show-labels
# 2. View node Taints
kubectl describe nodes | grep -A 5 Taints
# 3. Schedule a Pod to the master node
kubectl taint nodes controlplane node-role.kubernetes.io/control-plane:NoSchedule-
# 4. Create a Pod with Node Affinity (using generate name)
kubectl run nginx --image=nginx --restart=Never --dry-run=client -o yaml > pod.yaml
# Edit to add spec.affinity.nodeAffinity
# 5. Add Toleration to DaemonSet (common)
kubectl get ds fluentd -o yaml > ds.yaml
# Add tolerations to spec.template.spec in ds.yaml
# 6. Check if a Pod cannot be scheduled due to Taint
kubectl describe pod <pod-name> | grep -A 10 Events
# Output: 0/3 nodes are available: 1 node(s) had taint, 2 node(s) didn't match pod anti-affinity
# 7. Quickly create a PriorityClass
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: important
value: 100000
globalDefault: false
EOF
🧪 Complete Hands-on Example: Controlling Pod Scheduling with Taints/Tolerations and NodeAffinity
Scenario
Apply taints to nodes to control specific Pod scheduling, while configuring nodeAffinity for more fine-grained scheduling constraints.
Prerequisites
- A multi-node Kubernetes cluster (at least 2 worker nodes)
- kubectl is configured to connect to the cluster
- The cluster has
node1andnode2(if node names differ, replace them in the commands)
Steps
Step 1: Label Nodes and Apply Taints
# View nodes
kubectl get nodes
# Expected output: NAME STATUS ROLES AGE VERSION
# controlplane Ready control-plane 10m v1.29
# node1 Ready <none> 9m v1.29
# node2 Ready <none> 9m v1.29
# Label node1
kubectl label nodes node1 disktype=ssd
# Expected output: node/node1 labeled
# Apply a taint to node1 (only allow Pods tolerating this taint to be scheduled)
kubectl taint nodes node1 dedicated=gpu:NoSchedule
# Expected output: node/node1 tainted
Step 2: Create a Deployment with Toleration (Can be Scheduled on node1)
cat <<'EOF' > deploy-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-workload
spec:
replicas: 2
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-tolerate.yaml
# Expected output: deployment.apps/gpu-workload created
kubectl get pods -l app=gpu-app -o wide
# Expected output: Pods are scheduled to node1 (with toleration)
# NAME READY STATUS RESTARTS AGE NODE
# gpu-workload-<hash>-<pod-id> 1/1 Running 0 <s> node1
# gpu-workload-<hash>-<pod-id> 1/1 Running 0 <s> node1
Step 3: Create a Deployment without Toleration (Cannot be Scheduled)
cat <<'EOF' > deploy-no-tolerate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: normal-workload
spec:
replicas: 1
selector:
matchLabels:
app: normal
template:
metadata:
labels:
app: normal
spec:
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-no-tolerate.yaml
# Expected output: deployment.apps/normal-workload created
kubectl get pods -l app=normal
# Expected output: Pod will show Pending status
# NAME READY STATUS RESTARTS AGE
# normal-workload-<hash>-<pod-id> 0/1 Pending 0 <seconds>
Step 4: Analyze the Reason for Scheduling Failure
kubectl describe pod -l app=normal | grep -A 10 Events
# Expected output:
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 10s default-scheduler 0/2 nodes are available: 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate, 1 node(s) had taint {dedicated: gpu} that the pod didn't tolerate.
Step 5: Add Node Affinity (Soft Constraint)
# First delete pending Pods and update the Deployment
kubectl delete deployment normal-workload
# Expected output: deployment.apps "normal-workload" deleted
cat <<'EOF' > deploy-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: normal-workload
spec:
replicas: 1
selector:
matchLabels:
app: normal
template:
metadata:
labels:
app: normal
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
EOF
kubectl apply -f deploy-affinity.yaml
# Expected output: deployment.apps/normal-workload created
kubectl get pods -l app=normal -o wide
# Expected output: Pod is scheduled to node2 (no disktype=ssd label, but soft constraint is not mandatory)
# NAME READY STATUS RESTARTS AGE NODE
# normal-workload-<hash>-<pod-id> 1/1 Running 0 <s> node2
Verification
# Confirm gpu-workload runs on node1
kubectl get pods -l app=gpu-app -o wide | grep node1
# Expected output: gpu-workload Pod shows node1
# Confirm normal-workload runs on node2
kubectl get pods -l app=normal -o wide | grep node2
# Expected output: normal-workload Pod shows node2
# Cleanup
kubectl delete deployment gpu-workload normal-workload
kubectl taint nodes node1 dedicated=gpu:NoSchedule-
kubectl label nodes node1 disktype-
# Expected output: All resources cleaned up
Exam Tips
NoScheduleonly affects new Pods, not existing ones;NoExecutealso evicts existing Pods- Toleration
operator: Existswith a key matches all values;operator: Existswithout a key matches all taints - Node Affinity's
requiredDuringSchedulingis a hard constraint; Pods cannot be scheduled if not met - In the exam, if a Pod is in Pending state, first use
kubectl describe podto check Events to determine the cause - Master nodes have the
node-role.kubernetes.io/control-plane:NoScheduletaint by default; removing it allows Pods to be scheduled on the master - When creating a DaemonSet, you usually need to add tolerance for all taints:
- operator: Exists