In OpenShift, the kube-scheduler binds a unit of work (Pod) to a Node. The scheduler reads from a scheduling queue the work, retrieves the current state of the cluster, scores the work based on the scheduling rules (from the policy) and the cluster’s state, and prioritizes binding the Pod to a Node.
These nodes are scheduled based on an instantaneous read of the policy and the environment and a best-estimation placement of the Pod on a Node. With best estimate at the time, these clusters are constantly changing shape and context; there is a need to deschedule and schedule the Pod anew.
There are four actors in the Descheduler:
- User configures the KubeDescheduler resource
- Operator creates the Descheduler Deployment
- Descheduler run on a set interval and re-evaluates the scheduled Pod and Node and Policy, setting an eviction if the Pod should be removed based on the Descheduler Policy.
- Pod is removed (unbound).
Thankfully, OpenShift has a Descheduler Operator that more easily facilitates the unbinding of a Pod from a Node based on a cluster-wide configuration of the KubeDescheduler CustomResource. In a single cluster, there is at most one configured KubeDescheduler named cluster (it has to be fixed), and configures one or more Descheduler Profiles.
|AffinityAndTaints||Balance pods based on node taint violations|
|TopologyAndDuplicates||Spreads pods evenly among nodes based on topology constraints and duplicate replicates on the same node The profile cannot be used with SoftTopologyAndDuplicates.|
|SoftTopologyAndDuplicates||Spreads pods with prior with soft constraints The profile cannot be used with TopologyAndDuplicates.|
|LifecycleAndUtilization||Balances pods based on node resource usage This profile cannot be used with DevPreviewLongLifecycle|
|EvictPodsWithLocalStorage||Enables pods with local storage to be evicted by the descheduler by all other profiles|
|EvictPodsWithPVC||Prevents pods with PVCs from being evicted by all other profiles|
|DevPreviewLongLifecycle||Lifecycle management for pods that are ‘long running’ This profile cannot be used with LifecycleAndUtilization|
There must be one or more DeschedulerProfile specified, and there cannot be any duplicates entries. There are two possible mode values – Automatic and Predictive. You have to go the Pod to check the output to see what is Predicted or is Completed.
The DeschedulerOperator excludes the openshift-*, kube-system and hypershift namespaces.
1. Login to your OpenShift Cluster
oc login --token=sha256~1111-g --server=https://api..sslip.io:6443
2. Create a Pod that indicates it’s available for eviction using the annotation descheduler.alpha.kubernetes.io/evict: “true” and is updated for the proper node name.
cat << EOF > pod.yaml kind: Pod apiVersion: v1 metadata: annotations: descheduler.alpha.kubernetes.io/evict: "true" name: demopod1 labels: foo: bar spec: containers: - name: pause image: docker.io/ibmcom/pause-ppc64le:3.1 EOF oc apply -f pod.yaml pod/demopod1 created
- 3. Create the KubeDescheduler CR with a Descheduling Interval of 60 seconds and Pod Lifetime of 1m.
cat << EOF > kd.yaml apiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: logLevel: Normal mode: Predictive operatorLogLevel: Normal deschedulingIntervalSeconds: 60 profileCustomizations: podLifetime: 1m0s observedConfig: servingInfo: cipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 profiles: - LifecycleAndUtilization managementState: Managed EOF oc apply -f kd.yaml
- 4. Get the Pods in the openshift-kube-descheduler-operator
oc get pods -n openshift-kube-descheduler-operator NAME READY STATUS RESTARTS AGE descheduler-f479c5669-5ffxl 1/1 Running 0 2m7s descheduler-operator-85fc6666cb-5dfr7 1/1 Running 0 27h
- 5. Check the Logs for the descheduler pod
oc -n openshift-kube-descheduler-operator logs descheduler-f479c5669-5ffxl I0506 19:59:10.298440 1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="minio-operator/console-7bc65f7dd9-q57lr" maxPodLifeTime=60 I0506 19:59:10.298500 1 evictions.go:158] "Evicted pod in dry run mode" pod="default/demopod1" reason="PodLifeTime" I0506 19:59:10.298532 1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="default/demopod1" maxPodLifeTime=60 I0506 19:59:10.298598 1 toomanyrestarts.go:90] "Processing node" node="master-0.rdr-rhop-.sslip.io" I0506 19:59:10.299118 1 toomanyrestarts.go:90] "Processing node" node="master-1.rdr-rhop.sslip.io" I0506 19:59:10.299575 1 toomanyrestarts.go:90] "Processing node" node="master-2.rdr-rhop.sslip.io" I0506 19:59:10.300385 1 toomanyrestarts.go:90] "Processing node" node="worker-0.rdr-rhop.sslip.io" I0506 19:59:10.300701 1 toomanyrestarts.go:90] "Processing node" node="worker-1.rdr-rhop.sslip.io" I0506 19:59:10.301097 1 descheduler.go:287] "Number of evicted pods" totalEvicted=5
This article shows a simple case for the Descheduler and you can see how it ran a dry run and showed it would evict five pods.