Products

Arcfra AECP 6.3 Deep Dive: RDMA Cross-NIC HA for High-Performance Workload Reliability

Published on by Arcfra Team
Last edited on

High-performance workloads such as low-latency financial trading and mission-critical databases often rely on RDMA networks to achieve ultra-low latency. However, traditional HCI solutions typically provide RDMA redundancy protection only at the network-port level. When a physical NIC failure occurs, core business systems may still face the risk of service outages.

To address this, Arcfra AECP 6.3 introduces support for multi-path RDMA cross-NIC bonding. Based on OVS Bond, it enables multi-link bonding across different physical NICs, providing NIC-level storage network redundancy while maintaining RDMA’s microsecond-level low-latency advantage. Overall, AECP 6.3 can provide RDMA high availability protection comparable to FC networks, meeting the demanding requirements of mission-critical workloads for both performance and stability.

>>Learn more about why RDMA needs cross-NIC HA

How Does AECP 6.3 Deliver Complete High Availability for RDMA Networks?

AECP 6.3 redesigns the RDMA network architecture by replacing Linux Bond with OVS Bond, while supporting multi-port bonding across different physical NICs. This means that ports from two different physical NICs can now be added to the same bond. As a result, a single NIC failure or a single switch failure will no longer interrupt business traffic. This delivers true high availability against NIC-level failures, and also allows existing clusters to upgrade versions and acquire the new feature — helping mission-critical workloads achieve both high performance and stability.

Tech Insights: Evolution of Implementation

Arcfra’s exploration of high availability for RDMA networks has gone through several stages of technical evolution.

Earlier Releases: Linux Bonding + OpenFabrics

In earlier HCI’s RDMA network designs, NIC hardware bonding was commonly achieved based on the OpenFabrics Alliance standard. This approach relies on NIC hardware capabilities to distribute traffic across multiple ports within the same physical NIC. Its core assumption is that the RDMA Queue Pair (QP) state can be shared across multiple physical ports on the same NIC. However, this architecture has several limitations:

  • Only supports bonding within the same NIC: It cannot aggregate ports across different physical NICs, so a NIC-level failure will directly interrupt services.
  • Difficult to implement in active-active / dual-switch architectures: In LACP scenarios, if a QP sends packets through Port A but receives through Port B, Port B cannot recognize the QP state, and the packets will be dropped directly.
  • Incomplete HA capability: It cannot meet the redundancy requirements of dual-NIC and dual-switch architectures required by financial services and other mission-critical workloads.

The root cause is: QP state is tightly bound to physical ports and cannot be shared across devices, fundamentally limiting the HA capability of RDMA networks.

AECP 6.3: A New OVS + RDMA Software-Coordinated Bonding Approach

To overcome hardware-level limitations, AECP 6.3 introduces an innovative architecture based on OVS Bonding and RDMA software-layer multipathing.

1. Unified Abstraction at the Network Layer: OVS Bonding as the Foundation

With OVS Bonding, multiple physical ports are aggregated into a single logical port, providing a unified storage IP externally and remaining fully transparent to upper-layer applications. This design provides several key benefits:

  • Supports bonding across different physical NICs, fundamentally breaking hardware constraints.
  • Fully allows traditional TCP-based applications, such as ZooKeeper clusters.
  • Natively supports load balancing and failover, providing complete HA capabilities

2. RDMA Software-Layer Innovation: Multipathing Solves the Core Challenge

At the RDMA layer, AECP 6.3 introduces a multipathing mechanism to fundamentally solve the QP-bonding limitations of hardware-based approaches.

  • Multiple QPs: Establishes multiple QPs for the same logical connection, with each QP bound to a different physical port.
  • L4 Path Detection: Uses a Layer 4 path detection mechanism to identify the actual forwarding path of each five-tuple flow, and selects stable paths with consistent send/receive behavior.
  • Fast Reconnection Within Seconds: When one path becomes unavailable, the application layer can quickly switch to a standby QP, enabling failover without service awareness and without packet loss.

At the same time, it addresses the long-standing issues in OVS Bonding-based RDMA architectures, making it ideally suited for high-performance scenarios such as financial trading and active-active architectures.

MAC Flapping

Issue: MAC flapping in RDMA networks typically occurs when the same MAC address is incorrectly learned on multiple switch ports due to host or network configuration issues (such as bonding, LAG/MLAG inconsistency, or duplicate MAC assignment). This can result in packet loss and degraded RDMA performance.

Solutions:

  • Periodically retrieve the Active Slave of the OVS Bond.
  • Force RDMA connections to be established only through the active port.
  • Keep traffic paths consistent to eliminate MAC flapping at the source.

Long and Unpredictable Failover Reconnection Time

Issue: During failover, network disconnection and slow reconnection may affect replica safety and I/O latency.

Solutions:

  • Use an L3/L4-like detection mechanism to quickly identify available ports.
  • Optimize connection rebuilding and QP reconnection logic.
  • Significantly reduce failover latency and improve overall stability.

Unstable Connections in balance-tcp Mode

Issue: In Port Channel scenarios, inconsistent switch hashing may cause return traffic to arrive on a different port than the one used to establish the RDMA connection, resulting in packet drops.

Solutions:

  • First send probe packets to determine the available ports on both ends
  • Then establish the RDMA connection and bind it to the selected port
  • Ensure send/receive path consistency so the connection remains stable and reliable

Solution Comparison: Linux Bond vs. OVS Bond

DimensionsLinux BondOVS Bond
Cross-NIC BondingNot supported (only multiple ports on the same NIC)Supported (true NIC-level HA)
HA CoveragePort-level failures onlyPort-level + NIC-level + switch-level failures
StabilityHigh, based on the standard implementationHigh (all historical issues have been fixed in AECP 6.3)
Failover PerceptionNo service impact, no packet lossShort failover time, acceptable to business workloads
SupportedBonding Modesactive-backup / balance-xor / 802.3adactive-backup / balance-tcp
Unsupported ModesNonebalance-slb
ApplicableScenariosGeneral architectures without HA requirementsFinancial core systems, active-active architectures, and high-performance / high-reliability scenarios
Upgrade forExisting ClustersUsed nativelySupports one-click migration through network-tool/

Feature Highlights

1. True NIC-Level High Availability

Supports bonding RDMA ports across different physical NICs, ensuring that a single NIC failure does not interrupt business traffic.

2. Dual-Switch High Availability as a Standard Architecture

Meeting the stringent reliability requirements of financial services, trading systems, and mission-critical databases.

3. Fully Resolves Long-Standing Technical Issues

Fixes issues such as MAC flapping, unpredictable reconnection, and unstable balance-tcp behavior.

4. Supports Two Production-Ready Bonding Modes

  • active-backup: primary/standby high availability
  • balance-tcp: load balancing with aggregated bandwidth across multiple ports

5. Automatic Multipathing for Data Channels

In balance-tcp mode, multipathing is enabled automatically, fully utilizing multi-NIC bandwidth and improving cluster write performance.

6. Allows Existing Clusters for Smooth Upgrade

Existing clusters using Linux Bond can be switched via network-tool in one click, without requiring architectural changes.

7. Unified Network Architecture

Storage networks are unified under OVS Bond, simplifying management and improving operational efficiency.

Conclusion

Arcfra AECP 6.3 addresses the long-standing industry challenge of “insufficient HA for RDMA networks” through cross-NIC RDMA bonding:

  • High performance without compromise: preserves RDMA’s microsecond-level low-latency advantage.
  • Comprehensive HA enhancement: supports protection against NIC-level and switch-level failures.
  • Ready for mission-critical workloads: reliable and trusted for core business scenarios.
  • Financial-grade reliability without FC costs: delivers high reliability without requiring expensive FC networks.
  • True deployment of dual-NIC and dual-switch HA architectures: providing reliable protection for core business systems.

Learn more about upgraded features and DR capabilities of AECP 6.3 from our latest blogs:

Arcfra AECP 6.3 Breaks the 11M IOPS Barrier, Delivering Tier-1 All-Flash Performance and RPO=0 Resilience for Enterprise Cloud

Arcfra AECP 6.3 Deep Dive: Full-Stack Disaster Recovery with Synchronous Replication and Arcfra Operation Center High Availability

Arcfra AECP 6.3 Tech Insights: Stretched Cluster (Active-Active) vs. Synchronous Replication

What’s New in Arcfra Enterprise Cloud Platform 6.3

Arcfra AECP 6.3 Tech Insights: Why RDMA Needs Cross-NIC HA?

Arcfra AECP 6.3 Tech Insights: How to Achieve HA Protection for VMs Using SR-IOV & vGPU?

About Arcfra

Arcfra simplifies enterprise cloud infrastructure with a full-stack, software-defined platform built for the AI era. We deliver computing, storage, networking, security, Kubernetes, and more — all in one streamlined solution. Supporting VMs, containers, and AI workloads, Arcfra offers future-proof infrastructure trusted by enterprises across e-commerce, finance, and manufacturing. Arcfra is recognized by Gartner as a Representative Vendor in full-stack hyperconverged infrastructure. Learn more at www.arcfra.com.