FAQ

Arcfra AECP 6.3 Tech Insights: Why RDMA Needs Cross-NIC HA?

Published on by Arcfra Team
Last edited on

RDMA (Remote Direct Memory Access) is a remote memory access technology designed for high-performance scenarios. By bypassing the operating system kernel and protocol stack, RDMA enables direct data transfer between nodes with three core advantages: microsecond-level latency, high throughput, and low CPU usage.

This is especially important in latency-sensitive scenarios such as financial low-latency trading, core databases, high-performance computing, and AI training. In these use cases, traditional TCP/IP networks cannot meet the demand for microsecond-level responses, while RDMA can significantly improve I/O efficiency and reduce business processing latency, making it an essential technology for high-performance architectures of core businesses.

Despite RDMA’s exceptional performance, traditional approaches to achieving RDMA high availability still have clear limitations.

  • Limited to Linux Bond: Only multiple ports on the same physical NIC can be bonded, with no support for cross-NIC bonding.
  • A physical NIC failure can directly break the RDMA link: Linux Bond cannot handle NIC-level failures, which may cause storage I/O interruption, business stalls, or even service outages.
  • Unable to meet the HA requirements of mission-critical workloads: Core scenarios such as low-latency trading require NIC-level and switch-level high availability, which traditional solutions fail to provide.

To address the long-standing challenge of insufficient RDMA HA capabilities, Arcfra AECP 6.3 redesigns the RDMA network architecture by replacing Linux Bond with OVS Bond, while supporting multi-port bonding across different physical NICs.

This means that ports from two different physical NICs can now be added to the same bond. As a result, a single NIC failure or a single switch failure will no longer interrupt business traffic. This delivers true high availability against NIC-level failures, and also allows existing clusters to upgrade versions and acquire the new feature — helping mission-critical workloads achieve both high performance and stability.

Feature Highlights

  • True NIC-Level High Availability: Supports bonding RDMA ports across different physical NICs, ensuring that a single NIC failure does not interrupt business traffic.
  • Dual-Switch High Availability as a Standard Architecture: Meeting the stringent reliability requirements of financial services, trading systems, and mission-critical databases.
  • Lower Cost for Higher Reliability: Provides a financial-grade highly available architecture without dependency on FC networks, reducing infrastructure cost while maintaining reliability.

Learn more about performance leap and upgraded features in AECP 6.3 from our latest blogs:

Arcfra AECP 6.3 Breaks the 11M IOPS Barrier, Delivering Tier-1 All-Flash Performance and RPO=0 Resilience for Enterprise Cloud

What’s New in Arcfra Enterprise Cloud Platform 6.3

Arcfra AECP 6.3 Deep Dive: Full-Stack Disaster Recovery with Synchronous Replication and Arcfra Operation Center High Availability

Arcfra AECP 6.3 Tech Insights: Stretched Cluster (Active-Active) vs. Synchronous Replication

About Arcfra

Arcfra simplifies enterprise cloud infrastructure with a full-stack, software-defined platform built for the AI era. We deliver computing, storage, networking, security, Kubernetes, and more — all in one streamlined solution. Supporting VMs, containers, and AI workloads, Arcfra offers future-proof infrastructure trusted by enterprises across e-commerce, finance, and manufacturing. Arcfra is recognized by Gartner as a Representative Vendor in full-stack hyperconverged infrastructure. Learn more at www.arcfra.com.