Products

Arcfra AECP 6.3 Deep Dive | Expanding VM HA to SR-IOV and vGPU Workloads

Published on by Arcfra Team
Last edited on

In enterprise-critical scenarios such as low-latency trading and AI inference, organizations often attach hardware devices like SR-IOV passthrough NICs and GPUs to virtual machines (VMs) to achieve maximum performance and meet strict security and compliance requirements.

However, due to the hardware-bound nature of passthrough devices, High Availability (HA) cannot be enabled for these VMs. In the event of a physical host failure, recovery relies on manual intervention, resulting in downtime typically measured in hours — far from meeting production requirements.

>>Learn more about HA challenges for VMs using SR-IOV and vGPU

To address this challenge, Arcfra AECP 6.3 introduces HA capabilities for VMs using SR-IOV network and vGPU. By innovatively introducing a device tagging feature, it breaks the long-standing limitation in traditional virtualization platforms where SR-IOV and vGPU VMs cannot enable HA.

AECP 6.3 Breakthrough: Enabling HA for VMs Using SR-IOV & vGPU with Device Tagging

Arcfra AECP 6.3 introduces HA capabilities for VMs using SR-IOV and vGPU. This allows VMs attached to SR-IOV NICs or vGPU to be automatically rebuilt and rapidly recovered in the event of host failure, truly achieving “no compromise on performance with improved business reliability.” The core objectives of this feature include:

  • Enabling passthrough devices to evolve from “dedicated, static, and non-HA” to “pooled, schedulable, and HA-enabled.”
  • Allowing mission-critical workloads in finance and AI scenarios to achieve both extreme performance and high business continuity.
  • Unifying VM HA policies across AECP clusters to reduce O&M complexity and the risk of human error.

Tech Behind SR-IOV&vGPU VM HA: Device Tagging

图片2.png

Arcfra AECP 6.3 introduces the feature of device tagging to enable unified identification, scheduling, and rebuilding of VMs configured with SR-IOV Passthrough NIC and vGPU. This is key to enabling HA for VMs attached to virtualized hardware devices.

  • SR-IOV NICs: Users can customize the device tag.
  • vGPU: The system automatically generates and matches device tags based on GPU model and partitioning configuration.
  • When the cluster contains devices with the same tag and of the same type, and with an available count > 0, the VM automatically qualifies for triggering HA.

HA Triggering and Rebuild Workflow

  1. A physical host experiences an unexpected failure.
  2. The system detects the failure and automatically triggers VM HA.
  3. The system selects a target host within the cluster that is configured with SR-IOV passthrough NIC or vGPU with the same tag and of the same type.
  4. The VM is automatically booted and rebuilt on the target host.
  5. SR-IOV passthrough NICs and vGPUs are automatically attached to the rebuilt VM, enabling rapid business recovery.
  6. The entire process requires no manual intervention, reducing downtime from hours to minutes.

Key Feature Highlights

  • Does not rely on specific hardware and requires no changes to VM configurations.
  • Fully integrates with existing VM HA frameworks, providing unified policies.
  • Supports hybrid deployment, allowing unified management of both standard VMs and VMs using SR-IOV passthrough NICs or vGPUs.
  • Automatic failure detection, VM rebuilding, and business recovery.

Product Comparison: Arcfra vs. VMware vs. Nutanix

Comparison ItemArcfra AECP 6.3VMware/Nutanix and Other HCI Products
SR-IOV NIC VM HA✅ Supported❌ Not Supported
vGPU VM HA✅ Supported✅ Supported
Automatic Failure Rebuild✅ Supported✅ Supported
Core Scenario CoverageLow-Latency Trading & AIAI & General Scenarios

Overall Advantages

  1. Industry-Exclusive Support: The first in the industry to support HA for VMs using SR-IOV and vGPU devices, covering core financial trading and AI scenarios.
  2. Achieves both high performance and HA Protection: Maintains passthrough device performance while providing automatic failure recovery capabilities.
  3. Reduced Business Interruption Risk: Shifts from manual recovery (hours) to automatic rebuild (minutes).
  4. Simplified O&M and Lower Complexity: VMs attached to SR-IOV Passthrough NIC or vGPUs and standard VMs use the same HA, alerting, and monitoring framework.
  5. Enabling Scalable Deployment of Core Workloads: Ensures production-level HA for low-latency trading and AI inference workloads.

Achieving Comprehensive Availability Protection with AECP 6.3

In addition to HA for VMs with hardware-accelerated devices, Arcfra AECP 6.3 introduces new HA features such as RDMA cross-NIC HA, placement group-based availability zone policies, and end-to-end HA alerting. By enhancing AECP availability across four dimensions — device, network, scheduling, and O&M — AECP 6.3 builds a comprehensive availability framework that covers all core business scenarios.

  • Device-Level HA: Provides HA support for VMs using SR-IOV passthrough NICs and GPUs, enabling high-performance workloads to benefit from hardware acceleration while also gaining automated failure recovery.
  • Network-Level HA: Supports RDMA multi-link cross-NIC bonding, elevating storage network redundancy from the traditional port level to the NIC level. This significantly improves the reliability of high-performance networks while maintaining RDMA’s low-latency, high-throughput characteristics. >>Learn more
  • Scheduling-Level HA: Placement group rules now include availability zone policies, allowing VMs to be bound to primary or secondary availability zones. This ensures stable operation of workloads under active-active architectures. Regardless of cluster expansion, host replacement, or failure scheduling, VMs remain within the designated zone, preventing a single-zone failure from causing widespread business interruption and enhancing reliability and operability in active-active deployments.
  • O&M-Level HA: Introduces end-to-end HA alerting, covering key scenarios such as VM HA rebuild success, VM HA rebuild failure, local rebuild failure, and network fault–triggered HA. With clear alerts and event logs, O&M teams can monitor HA execution in real time, quickly identify anomalies, and respond promptly to failures, making HA truly observable, perceivable, and guaranteed.

Leveraging existing standard VM HA capabilities, placement group functionality, and HA priority settings, Arcfra AECP can provide more comprehensive HA protection for mission-critical business systems in industries such as finance, healthcare, and manufacturing, helping enterprise users build a stable, efficient, and deployable foundation.

Learn more about performance leap and upgraded features in AECP 6.3 from our latest blogs:

Arcfra AECP 6.3 Breaks the 11M IOPS Barrier, Delivering Tier-1 All-Flash Performance and RPO=0 Resilience for Enterprise Cloud

What’s New in Arcfra Enterprise Cloud Platform 6.3

Arcfra AECP 6.3 Deep Dive: Full-Stack Disaster Recovery with Synchronous Replication and Arcfra Operation Center High Availability

Arcfra AECP 6.3 Tech Insights: Stretched Cluster (Active-Active) vs. Synchronous Replication

Arcfra AECP 6.3 Deep Dive: RDMA Cross-NIC HA for High-Performance Workload Reliability

Arcfra AECP 6.3 Tech Insights: Does Its Real-World Performance Deliver?

About Arcfra

Arcfra simplifies enterprise cloud infrastructure with a full-stack, software-defined platform built for the AI era. We deliver computing, storage, networking, security, Kubernetes, and more — all in one streamlined solution. Supporting VMs, containers, and AI workloads, Arcfra offers future-proof infrastructure trusted by enterprises across e-commerce, finance, and manufacturing. Arcfra is recognized by Gartner as a Representative Vendor in full-stack hyperconverged infrastructure. Learn more at www.arcfra.com.