Operator Training – Part 1: Concepts and Why Use Go

A brief Operator training I gave to my team resulted in these notes. Thanks to many others in the reference section.

An Operator codifies the tasks commonly associated with administrating, operating, and supporting an application.  The codified tasks are event-driven responses to changes (create-update-delete-time) in the declared state relative to the actual state of an application, using domain knowledge to reconcile the state and report on the status.

Figure 1 Operator Pattern

Operators are used to execute basic and advanced operations:

Basic (Helm, Go, Ansible)

  1. Installation and Configuration
  2. Uninstall and Destroy
  3. Seamless Upgrades

Advanced (Go, Ansible)

  1. Application Lifecycle (Backup, Failure Recovery)
  2. Monitoring, Metrics, Alerts, Log Processing, Workload Analysis
  3. Auto-scaling: Horizontal and Vertical
  4. Event (Anomaly) Detection and Response (Remediation)
  5. Scheduling and Tuning
  6. Application Specific Management
  7. Continuous Testing and Chaos Monkey

Helm operators wrap helm charts in a simplistic view of the operation pass-through helm verbs, so one can install, uninstall, destroy, and upgrade using an Operator.

There are four actors in the Operator Pattern.

  1. Initiator – The user who creates the Custom Resource
  2. Operator – The Controller that operates on the Operand
  3. Operand – The target application
  4. OpenShift and Kubernetes Environment
Figure 2 Common Terms

Each Operator operates on an Operand using Managed Resources (Kubernetes and OpenShift) to reconcile states.  The states are described in a domain specific language (DSL) encapsulated in a Custom Resource to describe the state of the application:

  1. spec – The User communicates to the Operator the desired state (Operator reads)
  2. status – The Operator communicates back to the User (Operator writes)
$ oc get authentications cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Authentication
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
spec:
  oauthMetadata:
    name: ""
  serviceAccountIssuer: ""
  type: ""
  webhookTokenAuthenticator:
    kubeConfig:
      name: webhook-authentication-integrated-oauth
status:
  integratedOAuthMetadata:
    name: oauth-openshift

While not limited to writing spec and status, if we think spec is initiator specified, and if we think status is operator written, then we limit the chances of creating an unintended reconciliation loop.

The DSL is specified as Custom Resource Definition:

$ oc get crd machinehealthchecks.machine.openshift.io -o=yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
spec:
  conversion:
    strategy: None
  group: machine.openshift.io
  names:
    kind: MachineHealthCheck
    listKind: MachineHealthCheckList
    plural: machinehealthchecks
    shortNames:
    - mhc
    - mhcs
    singular: machinehealthcheck
  scope: Namespaced
    name: v1beta1
    schema:
      openAPIV3Schema:
        description: 'MachineHealthCheck'
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource'
            type: string
          metadata:
            type: object
          spec:
            description: Specification of machine health check policy
            properties:
              expectedMachines:
                description: total number of machines counted by this machine health
                  check
                minimum: 0
                type: integer
              unhealthyConditions:
                description: UnhealthyConditions contains a list of the conditions.
                items:
                  description: UnhealthyCondition represents a Node.
                  properties:
                    status:
                      minLength: 1
                      type: string
                    timeout:
                      description: Expects an unsigned duration string of decimal
                        numbers each with optional fraction and a unit suffix, eg
                        "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us"
                        (or "µs"), "ms", "s", "m", "h".
                      pattern: ^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
                      type: string
                    type:
                      minLength: 1
                      type: string
                  type: object
                minItems: 1
                type: array
            type: object

For example, these operators manage the applications by orchestrating operations based on changes to the CustomResource (DSL):

Operator Type/LanguageWhat it doesOperations
cluster-etcd-operator goManages etcd in OpenShiftInstall Monitor Manage
prometheus-operator goManages Prometheus monitoring on a Kubernetes clusterInstall Monitor Manage Configure
cluster-authentication-operator goManages OpenShift AuthenticationManage Observe

As a developer, we’re going to follow a common development pattern:

  1. Implement the Operator Logic (Reconcile the operational state)
  2. Bake Container Image
  3. Create or regenerate Custom Resource Definition (CRD)
  4. Create or regenerate Role-based Access Control (RBAC)
    1. Role
    1. RoleBinding
  5. Apply Operator YAML

Note, we’re not necessarily writing business logic, rather operational logic.

There are some best practices we follow:

  1. Develop one operator per application
    1. One CRD per Controller. Created and Fit for Purpose. Less Contention.
    1. No Cross Dependencies.
  2. Use Kubernetes Primitives when Possible
  3. Be Backwards Compatible
  4. Compartmentalize features via multiple controllers
    1. Scale = one controller
    1. Backup = one controller
  5. Use asynchronous metaphors with the synchronous reconciliation loop
    1. Error, then immediate return, backoff and check later
    1. Use concurrency to split the processing / state
  6. Prune Kubernetes Resources when not used
  7. Apps Run when Operators are stopped
  8. Document what the operator does and how it does it
  9. Install in a single command

We use the Operator SDK – one it’s supported by Red Hat and the CNCF.

operator-sdk: Which one? Ansible and Go

Kubernetes is authored in the Go language. Currently, OpenShift uses Go 1.17 and most operators are implemented in Go. The community has built many go-based operators, we have much more support on StackOverflow and a forum.

 AnsibleGo
Kubernetes SupportCached ClientsSolid, Complete and Rich Kubernetes Client
Language TypeDeclarative – describe the end stateImperative – describe how to get to the end state
Operator TypeIndirect Wrapped in the Ansible-OperatorDirect
StyleSystems AdministrationSystems Programming
PerformanceLink~4M at startup Single layer scratch image
SecurityExpanded Surface AreaLimited Surface Area

Go is ideal for concurrency, strong memory management, everything is baked into the executable deliverable – it’s in memory and ready-to-go. There are lots of alternatives to code NodeJS, Rust, Java, C#, Python. The OpenShift Operators are not necessarily built on the Operator SDK.

Summary

We’ve run through a lot of detail on Operators and learned why we should go with Go operators.

Reference

  1. CNCF Operator White Paper https://github.com/cncf/tag-app-delivery/blob/main/operator-wg/whitepaper/Operator-WhitePaper_v1-0.md
  2. Operator pattern https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
  3. Operator SDK Framework https://sdk.operatorframework.io/docs/overview/
  4. Kubernetes Operators 101, Part 2: How operators work https://developers.redhat.com/articles/2021/06/22/kubernetes-operators-101-part-2-how-operators-work?source=sso#
  5. Build Kubernetes with the Right Tool https://cloud.redhat.com/blog/build-your-kubernetes-operator-with-the-right-tool https://hazelcast.com/blog/build-your-kubernetes-operator-with-the-right-tool/
  6. Build Your Kubernetes Operator with the Right Tool
  7. Operator SDK Best Practices https://sdk.operatorframework.io/docs/best-practices/
  8. Google Best practices for building Kubernetes Operators and stateful apps https://cloud.google.com/blog/products/containers-kubernetes/best-practices-for-building-kubernetes-operators-and-stateful-apps
  9. Kubernetes Operator Patterns and Best Practises https://github.com/IBM/operator-sample-go
  10. Fast vs Easy: Benchmarking Ansible Operators for Kubernetes https://www.ansible.com/blog/fast-vs-easy-benchmarking-ansible-operators-for-kubernetes
  11. Debugging a Kubernetes Operator https://www.youtube.com/watch?v=8hlx6F4wLAA&t=21s
  12. Contributing to the Image Registry Operator https://github.com/openshift/cluster-image-registry-operator/blob/master/CONTRIBUTING.md
  13. Leszko’s OperatorCon Presentation
    1. YouTube https://www.youtube.com/watch?v=hTapESrAmLc
    1. GitHub Repo for Session: https://github.com/leszko/build-your-operator

Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.