
AI clusters have revolutionized data center traffic flow, shifting from north-south traffic between applications and the internet to east-west traffic between GPUs during model training and checkpointing. This shift highlights where bottlenecks occur, with CPUs now on the critical path for tasks like encapsulation, flow control, and security. This adds latency and variability, making it harder to fully utilize GPUs. As a result, Data Processing Units (DPUs) and SmartNICs have evolved from optional accelerators to essential infrastructure, as highlighted by NVIDIA CEO Jensen Huang during GTC 2021.
“Data center is the new unit of computing,” Jensen Huang said. “There’s no way you’re going to do that on the CPU. So you have to move the networking stack off. You want to move the security stack off, and you want to move the data processing and data movement stack off.” Jensen Huang, interview with The Next Platform.
NVIDIA claims its Spectrum-X Ethernet fabric, which includes congestion control, adaptive routing, and telemetry, can deliver up to 48% higher storage read bandwidth for AI workloads. This transformation turns the network interface into a layer that processes data, making offloading necessary for optimal performance.
Where AI Fabric Traffic and Reliability Become Significant
AI workloads operate synchronously, meaning when one node experiences congestion, all GPUs in the cluster wait. Meta reported that routing-induced flow collisions and uneven traffic distribution in early RoCE deployments “degraded the training performance up to more than 30%,” prompting changes in routing and collective tuning. These issues aren’t purely architectural; they emerge from how east-west flows behave at scale.
The Evolution of Flow Control in AI Fabrics
InfiniBand has long provided credit-based link-level flow control (per-VL) to guarantee lossless delivery and prevent buffer overruns, a hardware mechanism built into the link layer. Ethernet is evolving along similar lines through the Ultra Ethernet Consortium (UEC): its Ultra Ethernet Transport (UET) work introduces endpoint/host-aware transport, congestion management guided by real-time feedback, and coordination between endpoints and switches. This explicitly moves more congestion handling and telemetry into the NIC/endpoint.
InfiniBand remains the benchmark for deterministic fabric behavior. Ethernet-based AI fabrics are rapidly evolving through innovations in UET and SmartNIC. Network professionals must evaluate silicon capabilities, not just link speeds. Reliability is now determined by telemetry, congestion control, and offload support at the NIC/DPU level.
Offload Pattern: Encapsulation and Stateless Pipeline Processing
AI clusters at cloud and enterprise scale rely on overlays such as VXLAN and GENEVE to segment traffic across tenants and domains. Traditionally, these encapsulation tasks run on the CPU. DPUs and SmartNICs offload encapsulation, hashing, and flow matching directly into hardware pipelines, reducing jitter and freeing CPU cycles.
NVIDIA documents VXLAN hardware offloads on its NICs/DPUs and claims Spectrum-X delivers material AI-fabric gains, including up to 48% higher storage read bandwidth in partner tests and more than 4x lower latency versus traditional Ethernet in Supermicro benchmarking. Offload for VXLAN and stateless flow processing is supported across NVIDIA BlueField, AMD Pensando Elba, and Marvell OCTEON 10 platforms.
Competitive Landscape in DPU/SmartNIC Offloads
From a competitive perspective:
- NVIDIA focuses on integrating tightly with Datacenter Infrastructure-on-a-Chip (DOCA) for GPU-accelerated AI workloads.
- AMD Pensando offers P4 programmability and integration with Cisco Smart Switches.
- Intel IPU brings Arm-heavy designs for transport programmability.
Encapsulation offload is no longer a performance enhancer; it is foundational for predictable AI fabric behavior.
Offload Pattern: Inline Encryption and East-West Security
As AI models cross sovereign boundaries and multi-tenant clusters become common, encryption of east-west traffic has become mandatory. However, encrypting this traffic in the host CPU introduces measurable performance penalties.
In a joint VMware–6WIND–NVIDIA validation, BlueField-2 DPUs offloaded IPsec for a 25 Gbps testbed (2×25 GbE BlueField-2), demonstrating higher throughput and lower host-CPU use for the 6WIND vSecGW on vSphere 8.
Thanks to NVIDIA, Marvell positions its OCTEON 10 DPUs for inline security offload in AI data centers, citing integrated crypto accelerators capable of 400+ Gbps IPsec/TLS (Marvell OCTEON 10 DPU Family media deck); the company also highlights growing AI-infrastructure demand in its investor communications.
Encryption offload is shifting from optional to required as AI becomes regulated infrastructure.
Offload Pattern: Microsegmentation and Distributed Firewalling
GPU servers are often deployed in dense, multi-tenant environments, making microsegmentation and distributed firewalling crucial for security. Traditional CPUs can struggle with the high-speed, low-latency requirements of these tasks. DPUs and SmartNICs can offload these functions, providing better performance and security.
For example, NVIDIA’s BlueField-2 DPUs support hardware offload for microsegmentation, allowing for fine-grained traffic control and improved security. This capability is particularly valuable in AI data centers, where workloads are often sensitive and require robust security measures.
Real-World Use Cases
Consider a large-scale AI training job that spans multiple data centers. Without DPU/SmartNIC offloads, the CPUs would be overwhelmed by tasks like encapsulation, encryption, and microsegmentation. This would introduce latency, reduce throughput, and increase the risk of security breaches. With offloads in place, these tasks are handled efficiently, ensuring smooth operation and enhanced security.
Conclusion
The shift towards AI fabrics in data centers has made DPUs and SmartNICs indispensable. They offload critical tasks from CPUs, reducing latency, improving performance, and enhancing security. As AI workloads become more complex and security-sensitive, the role of DPUs/SmartNICs will only grow. Network professionals must stay abreast of these developments to ensure their AI fabrics run smoothly and securely.
FAQ
What are DPUs and SmartNICs, and why are they important in AI fabrics?
DPUs (Data Processing Units) and SmartNICs (Network Interface Cards) are specialized hardware components designed to offload tasks from CPUs, reducing latency and improving performance. In AI fabrics, they handle tasks like encapsulation, flow control, security, and encryption, freeing up CPUs to focus on other critical functions.
How do DPUs and SmartNICs enhance AI fabric performance?
DPUs and SmartNICs enhance AI fabric performance by offloading tasks that were traditionally handled by CPUs. This reduces latency, improves throughput, and ensures more predictable behavior. For example, offloading encapsulation and encryption can reduce latency by up to 4x and improve storage read bandwidth by up to 48%.
What are the key offload patterns in AI fabrics?
The key offload patterns in AI fabrics include encapsulation and stateless pipeline processing, inline encryption and east-west security, and microsegmentation and distributed firewalling. Each of these patterns addresses specific challenges in AI workloads, from performance optimization to security enhancement.
How do DPUs and SmartNICs contribute to AI fabric security?
DPUs and SmartNICs contribute to AI fabric security by offloading tasks like encryption and microsegmentation. This reduces the load on CPUs, improves performance, and ensures that security measures are applied consistently and efficiently. For instance, inline encryption offload can handle high-speed traffic securely without overwhelming the host CPU.
What are the benefits of using DPUs and SmartNICs in AI data centers?
Using DPUs and SmartNICs in AI data centers offers several benefits, including reduced latency, improved throughput, enhanced security, and better resource utilization. They free up CPUs to focus on other critical tasks, ensuring that AI workloads run smoothly and efficiently. Additionally, they support innovations like P4 programmability and integration with smart switches, providing flexibility and scalability.