High-performance workloads such as low-latency financial trading and mission-critical databases often rely on RDMA networks to achieve ultra-low latency. However, traditional HCI solutions typically provide RDMA redundancy protection only at the network-port level. When a physical NIC failure occurs, core business systems may still face the risk of service outages.
To address this, Arcfra AECP 6.3 introduces support for multi-path RDMA cross-NIC bonding. Based on OVS Bond, it enables multi-link bonding across different physical NICs, providing NIC-level storage network redundancy while maintaining RDMA’s microsecond-level low-latency advantage. Overall, AECP 6.3 can provide RDMA high availability protection comparable to FC networks, meeting the demanding requirements of mission-critical workloads for both performance and stability.
>>Learn more about why RDMA needs cross-NIC HA
AECP 6.3 redesigns the RDMA network architecture by replacing Linux Bond with OVS Bond, while supporting multi-port bonding across different physical NICs. This means that ports from two different physical NICs can now be added to the same bond. As a result, a single NIC failure or a single switch failure will no longer interrupt business traffic. This delivers true high availability against NIC-level failures, and also allows existing clusters to upgrade versions and acquire the new feature — helping mission-critical workloads achieve both high performance and stability.
Arcfra’s exploration of high availability for RDMA networks has gone through several stages of technical evolution.
Earlier Releases: Linux Bonding + OpenFabrics
In earlier HCI’s RDMA network designs, NIC hardware bonding was commonly achieved based on the OpenFabrics Alliance standard. This approach relies on NIC hardware capabilities to distribute traffic across multiple ports within the same physical NIC. Its core assumption is that the RDMA Queue Pair (QP) state can be shared across multiple physical ports on the same NIC. However, this architecture has several limitations:
The root cause is: QP state is tightly bound to physical ports and cannot be shared across devices, fundamentally limiting the HA capability of RDMA networks.
AECP 6.3: A New OVS + RDMA Software-Coordinated Bonding Approach
To overcome hardware-level limitations, AECP 6.3 introduces an innovative architecture based on OVS Bonding and RDMA software-layer multipathing.
1. Unified Abstraction at the Network Layer: OVS Bonding as the Foundation
With OVS Bonding, multiple physical ports are aggregated into a single logical port, providing a unified storage IP externally and remaining fully transparent to upper-layer applications. This design provides several key benefits:
2. RDMA Software-Layer Innovation: Multipathing Solves the Core Challenge
At the RDMA layer, AECP 6.3 introduces a multipathing mechanism to fundamentally solve the QP-bonding limitations of hardware-based approaches.
At the same time, it addresses the long-standing issues in OVS Bonding-based RDMA architectures, making it ideally suited for high-performance scenarios such as financial trading and active-active architectures.
MAC Flapping
Issue: MAC flapping in RDMA networks typically occurs when the same MAC address is incorrectly learned on multiple switch ports due to host or network configuration issues (such as bonding, LAG/MLAG inconsistency, or duplicate MAC assignment). This can result in packet loss and degraded RDMA performance.
Solutions:
Long and Unpredictable Failover Reconnection Time
Issue: During failover, network disconnection and slow reconnection may affect replica safety and I/O latency.
Solutions:
Unstable Connections in balance-tcp Mode
Issue: In Port Channel scenarios, inconsistent switch hashing may cause return traffic to arrive on a different port than the one used to establish the RDMA connection, resulting in packet drops.
Solutions:
Solution Comparison: Linux Bond vs. OVS Bond
| Dimensions | Linux Bond | OVS Bond |
|---|---|---|
| Cross-NIC Bonding | Not supported (only multiple ports on the same NIC) | Supported (true NIC-level HA) |
| HA Coverage | Port-level failures only | Port-level + NIC-level + switch-level failures |
| Stability | High, based on the standard implementation | High (all historical issues have been fixed in AECP 6.3) |
| Failover Perception | No service impact, no packet loss | Short failover time, acceptable to business workloads |
| SupportedBonding Modes | active-backup / balance-xor / 802.3ad | active-backup / balance-tcp |
| Unsupported Modes | None | balance-slb |
| ApplicableScenarios | General architectures without HA requirements | Financial core systems, active-active architectures, and high-performance / high-reliability scenarios |
| Upgrade forExisting Clusters | Used natively | Supports one-click migration through network-tool/ |
1. True NIC-Level High Availability
Supports bonding RDMA ports across different physical NICs, ensuring that a single NIC failure does not interrupt business traffic.
2. Dual-Switch High Availability as a Standard Architecture
Meeting the stringent reliability requirements of financial services, trading systems, and mission-critical databases.
3. Fully Resolves Long-Standing Technical Issues
Fixes issues such as MAC flapping, unpredictable reconnection, and unstable balance-tcp behavior.
4. Supports Two Production-Ready Bonding Modes
5. Automatic Multipathing for Data Channels
In balance-tcp mode, multipathing is enabled automatically, fully utilizing multi-NIC bandwidth and improving cluster write performance.
6. Allows Existing Clusters for Smooth Upgrade
Existing clusters using Linux Bond can be switched via network-tool in one click, without requiring architectural changes.
7. Unified Network Architecture
Storage networks are unified under OVS Bond, simplifying management and improving operational efficiency.
Arcfra AECP 6.3 addresses the long-standing industry challenge of “insufficient HA for RDMA networks” through cross-NIC RDMA bonding:
Learn more about upgraded features and DR capabilities of AECP 6.3 from our latest blogs:
Arcfra AECP 6.3 Tech Insights: Stretched Cluster (Active-Active) vs. Synchronous Replication
What’s New in Arcfra Enterprise Cloud Platform 6.3
Arcfra AECP 6.3 Tech Insights: Why RDMA Needs Cross-NIC HA?
Arcfra AECP 6.3 Tech Insights: How to Achieve HA Protection for VMs Using SR-IOV & vGPU?
Arcfra simplifies enterprise cloud infrastructure with a full-stack, software-defined platform built for the AI era. We deliver computing, storage, networking, security, Kubernetes, and more — all in one streamlined solution. Supporting VMs, containers, and AI workloads, Arcfra offers future-proof infrastructure trusted by enterprises across e-commerce, finance, and manufacturing. Arcfra is recognized by Gartner as a Representative Vendor in full-stack hyperconverged infrastructure. Learn more at www.arcfra.com.