Updates for End of March 2024

Here are some great updates for the first half of April 2024.

Sizing and configuring an LPAR for AI workloads

Sebastian Lehrig has a great introduction into CPU/AI/NUMA on Power10.

https://community.ibm.com/community/user/powerdeveloper/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai

FYI: a new article is published – Improving the User Experience for Multi-Architecture Compute on IBM Power

More and more IBM® Power® clients are modernizing securely with lower risk and faster time to value with cloud-native microservices on Red Hat® OpenShift® running alongside their existing banking and industry applications on AIX, IBM i, and Linux. With the availability of Red Hat OpenShift 4.15 on March 19th, Red Hat and IBM introduced a long-awaited innovation called Multi-Architecture Compute that enables clients to mix Power and x86 worker nodes in a single Red Hat OpenShift cluster. With the release of Red Hat OpenShift 4.15, clients can now run the control plane for a Multi-Architecture Compute cluster natively on Power.

Some tips for setting up a Multi-Arch Compute Cluster

Setting up a multi-arch compute cluster manually, not using automation, you’ll want to follow this process:

  1. Setup the Initial Cluster with the multi payload on Intel or Power for the Control Plane.
  2. Open the network ports between the two environments

ICMP/TCP/UDP flowing in both directions

  1. Configure the Cluster

a. Change any MTU between the networks

oc patch Network.operator.openshift.io cluster --type=merge --patch \
    '{"spec": { "migration": { "mtu": { "network": { "from": 1400, "to": 1350 } , "machine": { "to" : 9100} } } } }'

b. Limit CSI drivers to a single Arch

oc annotate --kubeconfig /root/.kube/config ns openshift-cluster-csi-drivers \
  scheduler.alpha.kubernetes.io/node-selector=kubernetes.io/arch=amd64

c. Disable offloading (I do this in the ignition)

d. Move the imagepruner jobs to the architecture that makes the most sense

oc patch imagepruner/cluster -p '{ "spec" : {"nodeSelector": {"kubernetes.io/arch" : "amd64"}}}' --type merge

e. Move the ingress operator pods to the arch that makes the most sense. If you want the ingress pods to be on Intel then patch the clsuter.

oc edit IngressController default -n openshift-ingress-operator

Change ingresscontroller.spec.nodePlacement.nodeSelector to use the kubernetes.io/arch: amd64 to move the workfload to Intel only.

f. use routing via host

oc patch network.operator/cluster --type merge -p \
  '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":true}}}}}'

Wait until the MCP is finished updating and has the latest MTU

g. Download the igntion file and host on the local network via http.

  1. Create a new VSI worker and point to the ignition in userdata
{
    "ignition": {
        "version": "3.4.0",
        "config": {
            "merge": [
                {
                    "source": "http://${ignition_ip}:8080/ignition/worker.ign"
                }
            ]
        }
    },
    "storage": {
        "files": [
            {
                "group": {},
                "path": "/etc/hostname",
                "user": {},
                "contents": {
                    "source": "data:text/plain;base64,${name}",
                    "verification": {}
                },
                "mode": 420
            },
            {
                "group": {},
                "path": "/etc/NetworkManager/dispatcher.d/20-ethtool",
                "user": {},
                "contents": {
                    "source": "data:text/plain;base64,aWYgWyAiJDEiID0gImVudjIiIF0gJiYgWyAiJDIiID0gInVwIiBdCnRoZW4KICBlY2hvICJUdXJuaW5nIG9mZiB0eC1jaGVja3N1bW1pbmciCiAgL3NiaW4vZXRodG9vbCAtLW9mZmxvYWQgZW52MiB0eC1jaGVja3N1bW1pbmcgb2ZmCmVsc2UgCiAgZWNobyAibm90IHJ1bm5pbmcgdHgtY2hlY2tzdW1taW5nIG9mZiIKZmkKaWYgc3lzdGVtY3RsIGlzLWZhaWxlZCBOZXR3b3JrTWFuYWdlci13YWl0LW9ubGluZQp0aGVuCnN5c3RlbWN0bCByZXN0YXJ0IE5ldHdvcmtNYW5hZ2VyLXdhaXQtb25saW5lCmZpCg==",
                    "verification": {}
                },
                "mode": 420
            }
        ]
    }
}

${name} is base64 encoded.

  1. Post configuration tasks

a. Configure shared storage using the nfs provisioner and limit to running from the architecture that is hosting the NFS shared volumes.

b. Approve the CSRs for the workers. Do this carefully as it’s possible to lose the count as it may include Machine updates/csrs.

  1. Check the cluster operators and nodes it should be up and working.

Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.