OpenShift Descheduler Operator: How-To

In OpenShift, the kube-scheduler binds a unit of work (Pod) to a Node. The scheduler reads from a scheduling queue the work, retrieves the current state of the cluster, scores the work based on the scheduling rules (from the policy) and the cluster’s state, and prioritizes binding the Pod to a Node.

These nodes are scheduled based on an instantaneous read of the policy and the environment and a best-estimation placement of the Pod on a Node. With best estimate at the time, these clusters are constantly changing shape and context; there is a need to deschedule and schedule the Pod anew. 

There are four actors in the Descheduler:

  1. User configures the KubeDescheduler resource
  2. Operator creates the Descheduler Deployment
  3. Descheduler run on a set interval and re-evaluates the scheduled Pod and Node and Policy, setting an eviction if the Pod should be removed based on the Descheduler Policy.
  4. Pod is removed (unbound).

Thankfully, OpenShift has a Descheduler Operator that more easily facilitates the unbinding of a Pod from a Node based on a cluster-wide configuration of the KubeDescheduler CustomResource. In a single cluster, there is at most one configured KubeDescheduler named cluster (it has to be fixed), and configures one or more Descheduler Profiles

Descheduler Profiles are predefined and available in the profiles folder – DeschedulerProfile:

AffinityAndTaintsBalance pods based on node taint violations
TopologyAndDuplicatesSpreads pods evenly among nodes based on topology constraints and duplicate replicates on the same node   The profile cannot be used with SoftTopologyAndDuplicates.
SoftTopologyAndDuplicatesSpreads pods with prior with soft constraints   The profile cannot be used with TopologyAndDuplicates.
LifecycleAndUtilizationBalances pods based on node resource usage   This profile cannot be used with DevPreviewLongLifecycle
EvictPodsWithLocalStorageEnables pods with local storage to be evicted by the descheduler by all other profiles
EvictPodsWithPVCPrevents pods with PVCs from being evicted by all other profiles
DevPreviewLongLifecycleLifecycle management for pods that are ‘long running’   This profile cannot be used with LifecycleAndUtilization

There must be one or more DeschedulerProfile specified, and there cannot be any duplicates entries. There are two possible mode values – Automatic and Predictive. You have to go the Pod to check the output to see what is Predicted or is Completed.

The DeschedulerOperator excludes the openshift-*, kube-system and hypershift namespaces.

Steps

1.   Login to your OpenShift Cluster 
oc login --token=sha256~1111-g --server=https://api..sslip.io:6443

2. Create a Pod that indicates it’s available for eviction using the annotation descheduler.alpha.kubernetes.io/evict: “true” and is updated for the proper node name.

cat << EOF > pod.yaml 
kind: Pod
apiVersion: v1
metadata:
  annotations:
    descheduler.alpha.kubernetes.io/evict: "true"
  name: demopod1
  labels:
    foo: bar
spec:
  containers:
  - name: pause
    image: docker.io/ibmcom/pause-ppc64le:3.1
EOF
oc apply -f pod.yaml 
pod/demopod1 created
  • 3. Create the KubeDescheduler CR with a Descheduling Interval of 60 seconds and Pod Lifetime of 1m.
cat << EOF > kd.yaml 
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  logLevel: Normal
  mode: Predictive
  operatorLogLevel: Normal
  deschedulingIntervalSeconds: 60
  profileCustomizations:
    podLifetime: 1m0s
  observedConfig:
    servingInfo:
      cipherSuites:
        - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
        - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      minTLSVersion: VersionTLS12
  profiles:
    - LifecycleAndUtilization
  managementState: Managed
EOF
oc apply -f kd.yaml
  • 4. Get the Pods in the openshift-kube-descheduler-operator
oc get pods -n openshift-kube-descheduler-operator                              
NAME                                    READY   STATUS    RESTARTS   AGE
descheduler-f479c5669-5ffxl             1/1     Running   0          2m7s
descheduler-operator-85fc6666cb-5dfr7   1/1     Running   0          27h
  • 5. Check the Logs for the descheduler pod
oc -n openshift-kube-descheduler-operator logs descheduler-f479c5669-5ffxl
I0506 19:59:10.298440       1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="minio-operator/console-7bc65f7dd9-q57lr" maxPodLifeTime=60
I0506 19:59:10.298500       1 evictions.go:158] "Evicted pod in dry run mode" pod="default/demopod1" reason="PodLifeTime"
I0506 19:59:10.298532       1 pod_lifetime.go:110] "Evicted pod because it exceeded its lifetime" pod="default/demopod1" maxPodLifeTime=60
I0506 19:59:10.298598       1 toomanyrestarts.go:90] "Processing node" node="master-0.rdr-rhop-.sslip.io"
I0506 19:59:10.299118       1 toomanyrestarts.go:90] "Processing node" node="master-1.rdr-rhop.sslip.io"
I0506 19:59:10.299575       1 toomanyrestarts.go:90] "Processing node" node="master-2.rdr-rhop.sslip.io"
I0506 19:59:10.300385       1 toomanyrestarts.go:90] "Processing node" node="worker-0.rdr-rhop.sslip.io"
I0506 19:59:10.300701       1 toomanyrestarts.go:90] "Processing node" node="worker-1.rdr-rhop.sslip.io"
I0506 19:59:10.301097       1 descheduler.go:287] "Number of evicted pods" totalEvicted=5

This article shows a simple case for the Descheduler and you can see how it ran a dry run and showed it would evict five pods.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.