RoCE (RDMA over Converged Ethernet: Demo

The following is a research project I investigated… and notes on what I would do, saving for others to take advantage of:

To demonstrate RoCE (RDMA over Converged Ethernet) usage across nodes on Red Hat OpenShift, you need a container image that includes the RDMA core librariesOFED drivers, and performance testing tools like perftest (which provides ib_write_bwib_send_lat, etc.).

Based on the Red Hat learning path you provided, here is a optimized Podman/Docker Dockerfile and the necessary configuration to run it.

1. The Podman/Docker image

This Dockerfile uses Red Hat Universal Base Image (UBI) 9 and installs the essential RDMA stack and the perftest suite.

# Use RHEL 9 UBI as the base
FROM registry.access.redhat.com/ubi9/ubi:latest

LABEL maintainer="OpenShift RoCE Demo"

# Install RDMA core libraries, drivers, and performance testing tools
# 'perftest' contains the ib_write_bw, ib_read_bw, etc. commands
RUN dnf install -y \
    libibverbs \
    libibverbs-utils \
    rdma-core \
    iproute \
    pciutils \
    ethtool \
    perftest \
    && dnf clean all

# Set working directory
WORKDIR /root

# Default command to keep the container running so you can 'exec' into it
CMD ["sleep", "infinity"]

2. Build and Push the Image

Use Podman to build the image and push it to a registry accessible by your OpenShift cluster (e.g., Quay.io or your internal OpenShift registry).

# Build the image
podman build -t quay.io/<your-username>/roce-test:latest .

# Push the image
podman push quay.io/<your-username>/roce-test:latest

3. Demonstrating Cross-Node Usage (The Test)

To prove RoCE is working across nodes, you must bypass the standard SDN (Software Defined Network) by using Host Networking or a Secondary Network (Multus). For a quick demonstration, we use hostNetwork: true.

Step A: Deploy two Pods on different nodes

Create a file named roce-demo.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: roce-server
  labels:
    app: roce-test
spec:
  hostNetwork: true # Required to access the host's RDMA/RoCE hardware
  containers:
  - name: main
    image: quay.io/<your-username>/roce-test:latest
    securityContext:
      privileged: true # Required for RDMA device access
---
apiVersion: v1
kind: Pod
metadata:
  name: roce-client
  labels:
    app: roce-test
spec:
  hostNetwork: true
  containers:
  - name: main
    image: quay.io/<your-username>/roce-test:latest
    securityContext:
      privileged: true

Step B: Run the Performance Benchmark

  1. Identify the IP of the Server Node:
oc get pod roce-server -o wide
# Note the IP (since it's hostNetwork, this is the Node's IP)
  1. Start the Server:
oc exec -it roce-server -- ib_write_bw -d <rdma_device_name> -a

(Note: Use ibv_devinfo inside the pod to find your device name, e.g., mlx5_0) 3. Run the Client (from the other pod):

oc exec -it roce-client -- ib_write_bw -d <rdma_device_name> <server_ip> -a

How this demonstrates RoCE:

  • Zero-Copy: The ib_write_bw tool performs memory-to-memory transfers without involving the CPU’s TCP/IP stack.
  • Performance: If RoCE is correctly configured in your OpenShift cluster (via the Node Network Configuration Policy), you will see bandwidth near the line rate (e.g., ~95Gbps on a 100G link) with extremely low latency compared to standard Ethernet.
  • Verification: You can run ethtool -S <interface> on the host while the test is running to see the rdma_ counters increasing, confirming the traffic is not using standard TCP.