Troubleshooting Isolated CPUs for low latency workloads on OpenShift
Table of Contents
Introduction
When we talk about low latency workloads, there are some requirements at Linux Kernel settings and how we place those workloads to run on the hardware. It is important that the workloads run as close as possible to the hardware they are going to interact with, like a network interface. Hence we will need to ensure that these workloads run in CPUs within the same NUMA node to which the network interface is connected, to avoid cross NUMA connections.
In OpenShift there is a Cluster Operator (it comes by default with OpenShift) that allows configuring the cluster nodes in an opinionated way based on Kubernetes CR. This CR is called PerformanceProfile and it is described in the next section.
Even though we have our OpenShift cluster configured with a PerformanceProfile, we need to deploy our application accordingly with those settings, and we are going to describe in this article how to review that our application is running on isolated CPUs in the desired NUMA node, and that there are no task switches or undesired IRQ interruptions.
In this article we will explain how to configure the reserved CPUs for the OpenShift control plane and the isolated CPUs to run the workload. Later on we are going to deploy a DPDK application based on testpmd, and we are going to go through some steps to validate that the workloads are running as expected in the OpenShift cluster.
PerformanceProfile
The PerformanceProfile is a Custom Resource (CR) provided by the Performance Addon Operator (now part of the Node Tuning Operator) in OpenShift. It allows cluster administrators to configure nodes for low-latency workloads by applying a set of performance-focused kernel and system configurations.
Key features of the PerformanceProfile include:
- CPU Isolation: Segregates CPUs into reserved (for system/control plane tasks) and isolated (dedicated for workload execution) sets
- NUMA Awareness: Ensures workloads run on CPUs that share the same NUMA node as their hardware resources
- Kernel Tuning: Applies kernel parameters like
isolcpus,nohz_full, andrcu_nocbsto minimize interruptions on isolated CPUs - IRQ Affinity: Routes hardware interrupts away from isolated CPUs to prevent latency spikes
- Huge Pages: Configures huge pages for memory-intensive applications like DPDK
By defining a PerformanceProfile, OpenShift automatically configures the underlying nodes with the necessary optimizations for deterministic, low-latency application performance.
Configuring reserved/isolated CPUs
As described above, different aspects can be configured from the PerformanceProfile, but in this article we are going to describe how to set the right values for the reserved and isolated CPUs.
The reserved CPUs are the ones used by the system and the OpenShift control plane. This is important to ensure that in case the user workloads start consuming extra resources, they do not affect the cluster itself. Also the other way around, if the Control Plane needs more resources, we need to ensure the workloads have room to keep working.
Depending on the use case, we are going to configure more or less reserved and isolated CPUs, and also depending on whether we are configuring only a group of workers or also the control plane nodes. For this article we are going to discuss only the creation of a PerformanceProfile for workers, for which we are going to reserve 4 vCPUs (hyperthreading is enabled) on both NUMA nodes. It is important to highlight that we are working on bare metal.
The first thing we have to do is check the distribution of the cores and the siblings between NUMA nodes, to ensure we reserve the first core with its sibling plus an additional core. The below command shows the distribution of the cores (this command is run within the node):
sh-5.1# lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 3800.0000 800.0000 1800.0000
1 1 1 1 64:64:64:1 yes 3800.0000 800.0000 2165.7891
2 0 0 2 19:19:19:0 yes 3800.0000 800.0000 1800.0000
3 1 1 3 83:83:83:1 yes 3800.0000 800.0000 1800.0000
4 0 0 4 8:8:8:0 yes 3800.0000 800.0000 2260.8970
5 1 1 5 72:72:72:1 yes 3800.0000 800.0000 1800.0000
6 0 0 6 27:27:27:0 yes 3800.0000 800.0000 1800.0000
# output omitted ...
64 0 0 0 0:0:0:0 yes 3800.0000 800.0000 3800.0000
65 1 1 1 64:64:64:1 yes 3800.0000 800.0000 2058.6201
66 0 0 2 19:19:19:0 yes 3800.0000 800.0000 3800.0000
67 1 1 3 83:83:83:1 yes 3800.0000 800.0000 2275.6350
68 0 0 4 8:8:8:0 yes 3800.0000 800.0000 3800.0000
69 1 1 5 76:76:76:1 yes 3800.0000 800.0000 3800.0000
70 0 0 6 27:27:27:0 yes 3800.0000 800.0000 3800.0000
# output omitted ...
123 1 1 59 84:84:84:1 yes 3800.0000 800.0000 3800.0000
124 0 0 60 15:15:15:0 yes 3800.0000 800.0000 3800.0000
125 1 1 61 72:72:72:1 yes 3800.0000 800.0000 3800.0000
126 0 0 62 22:22:22:0 yes 3800.0000 800.0000 3800.0000
127 1 1 63 80:80:80:1 yes 3800.0000 800.0000 3800.0000
In the output above we can see in the first column the vCPU id, in the second column the NUMA node, and in the fourth column the Core id. We must use the first vCPU and its sibling, which is vCPU 64. We add three more vCPUs for the reservation, then we are going to reserve 0-1 and 64-65. The rest will be configured within the isolated CPUs, and the sum of both must match the total amount of CPUs, otherwise it will fail.
Below is the PerformanceProfile used for this article. Even though there are other parameters, we are going to cover only the part of the reserved/isolated CPUs configuration here.
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
annotations:
kubeletconfig.experimental: |
{"systemReserved": {"memory": "8Gi"}, "topologyManagerScope": "pod"}
name: lowlatency
spec:
additionalKernelArgs:
- nohz_full='4-63,69-127'
- nohz_full='2-63,66-127'
cpu:
isolated: 2-63,66-127
reserved: 0-1,64-65
globallyDisableIrqLoadBalancing: false
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 6
node: 0
size: 1G
- count: 6
node: 1
size: 1G
kernelPageSize: 4k
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/lowlatency: ""
net:
devices:
- interfaceName: eno12399
- interfaceName: eno12419
userLevelNetworking: true
nodeSelector:
node-role.kubernetes.io/lowlatency: ""
numa:
topologyPolicy: single-numa-node
realTimeKernel:
enabled: false
workloadHints:
highPowerConsumption: false
perPodPowerManagement: true
realTime: false
It is not covered in this article, but we must create a MachineConfigPool called lowlatency in our cluster where this PerformanceProfile will be applied. Once applied, we can check as shown in the command below that the status of the MachineConfigPool of all the nodes belonging to it is UPDATED True.
$oc get mcp lowlatency
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
lowlatency rendered-lowlatency-0c3b91cfa6eaa7b559a2eb994cd2c4f1 True False False 3 3 3 0 18d
Deploy testpmd workload
Now that we have the workers configured with the PerformanceProfile, we can deploy a pod to run a DPDK application with CPU pinning. The below pod manifest is an example that deploys a testpmd container.
It is important to highlight some aspects of the below Pod manifests:
- In order to allow the scheduler to use the QoS, we have to set the same amount of
limitsandresourceswithin the.spec.containers.resources. - The field
runtimeClassNamemust be set withperformance-plus the name of thePerformanceProfile. In this case the value isperformance-lowlatency. - The annotations regarding the irq load balancing is important to ensure the pined CPUs will not run interruptions.
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: '[
{
"name": "net1",
"namespace": "testpmd"
}
]'
cpu-load-balancing.crio.io: disable
cpu-quota.crio.io: disable
irq-load-balancing.crio.io: "disable"
labels:
app: testpmd
name: testpmd
namespace: testpmd
spec:
tolerations:
- key: "high-throughput"
value: "true"
effect: "NoSchedule"
containers:
- command:
- /bin/bash
- -c
- sleep infinity
image: quay.io/javierpena/dpdk:21.11.2_c9s
imagePullPolicy: Always
name: pktgen
resources:
limits:
cpu: "10"
hugepages-1Gi: 4Gi
memory: 2Gi
requests:
cpu: "10"
hugepages-1Gi: 4Gi
memory: 2Gi
securityContext:
privileged: true
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
- NET_RAW
- NET_ADMIN
runAsUser: 0
volumeMounts:
- mountPath: /mnt/huge
name: hugepages
- name: modules
mountPath: /lib/modules
terminationGracePeriodSeconds: 5
volumes:
- name: modules
hostPath:
path: /lib/modules
- emptyDir:
medium: HugePages
name: hugepages
runtimeClassName: performance-lowlatency
We apply the manifests above to run our workload in the testpmd Namespace.
NOTE
In this example, SR-IOV is already configured to attach the VFs to the pod. That is not included in this article to focus only on the troubleshooting of the CPU resources.
Troubleshooting isolated CPUs
- Find worker where the workload is running
$oc -n testpmd get pods testpmd -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
testpmd 1/1 Running 0 5d14h 10.129.2.12 worker3.ocp1.e5gc.bos2.lab <none> <none>
- Get the ContainerId of pod
$oc debug node/worker3.ocp1.e5gc.bos2.lab
Starting pod/worker3ocp1e5gcbos2lab-debug-vb2t4 ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
Pod IP: 192.168.82.73
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1#
sh-5.1# crictl ps | grep testpmd
3c0e19d3e6964 quay.io/javierpena/dpdk@sha256:bc647e696a16332d7c129d33ccccf40a157f88acd644eff7f6ce148e206b1d43 5 days ago Running pktgen 0 dc474cfee8a7e testpmd
- Get PID of the container
[root@worker3 ~]# crictl inspect 3c0e19d3e6964 | jq .info.pid
2138345
- Get CPU pinned
[root@worker3 ~]# crictl inspect 3c0e19d3e6964 | jq .status.resources.linux.cpusetCpus
"4,6,8,10,12,68,70,72,74,76"
- Check if there are task switches associated with given PID. The important thing is to ensure there are no changes in the switches. Only the processes from the pod should be in there and no new task must be scheduled.
watch grep switches /proc/2138345/task/2138345/sched
nr_switches : 6
nr_voluntary_switches : 6
nr_involuntary_switches : 0
In case we see new switches we can check which processes are being scheduled in the pinned CPU beside the ones from the pod with the below command. For this it is required to exit from the chroot /host or run toolbox to have the perf tool installed in an OpenShift node.
perf top --sort comm,dso -C <CPU> -z
- Review the interruptions in the pined cores
CPUMAX=`cat /proc/cpuinfo | grep processor | tail -n 1 | egrep -o [0-9]*$`
$ echo === NAME of IRQs for every CPU ===
$ for C in `seq 0 $CPUMAX` ; do
echo -n CPU${C}:
IRQS=`grep -H ${C} /proc/irq/*/effective_affinity_list | grep :${C}$ | cut -f 4 -d '/'`
for I in $IRQS ; do
IRQNAME=`cat /proc/interrupts | grep \ ${I}\: | awk '{print $(NF)}'`
echo -n " "${IRQNAME}
done
echo
done
=== NAME of IRQs for every CPU ===
CPU0: timer
CPU1:
CPU2: AMD-Vi
CPU3: AMD-Vi
CPU4: AMD-Vi
CPU5: AMD-Vi
...
CPU71:
CPU72: megasas0-msix80
CPU73: megasas0-msix81
CPU74: megasas0-msix82
CPU75: megasas0-msix83
Conclusions
In this article we have explored the essential steps for configuring and troubleshooting CPU isolation for low-latency workloads on OpenShift using the PerformanceProfile custom resource. Proper CPU isolation is critical for achieving deterministic performance in telco and high-performance computing workloads, where even minimal jitter and latency can significantly impact application behavior.
We demonstrated how to configure reserved and isolated CPUs by analyzing the NUMA topology and core distribution, ensuring that system tasks and user workloads are properly segregated. The PerformanceProfile provides a declarative way to apply complex kernel tuning and resource isolation automatically across the cluster nodes, simplifying what would otherwise be a complex manual configuration process.
Through the testpmd deployment example, we showed how to properly configure a DPDK application with CPU pinning and NUMA awareness. The troubleshooting steps outlined - from identifying the worker node and container PID, to monitoring task switches and IRQ affinity - provide a systematic approach to validate that CPU isolation is working as expected.
Key takeaways:
- Reserve adequate CPUs (including siblings) on both NUMA nodes for system and control plane operations
- Ensure the sum of reserved and isolated CPUs matches the total CPU count
- Use CPU pinning annotations and appropriate runtime class for workload pods
- Monitor task switches to detect unwanted scheduling on isolated cores
- Verify IRQ affinity to prevent hardware interrupts on isolated CPUs
Mastering these techniques enables you to deploy and maintain high-performance, low-latency workloads on OpenShift with confidence that your CPU isolation configuration is functioning correctly.