Products mentioned in this blog:
Virtual Machine High Availability (VM HA) is a crucial feature for virtualization platforms. It ensures that VMs can be quickly rebuilt in case of server failures, thereby minimizing failure‘s impact on business operations. To maintain business continuity as much as possible, VM HA should be able to cover a broad range of failure scenarios, accurately identify specific failures, and carry out appropriate switchover strategies according to failure issues.
As a full-stack software-defined infrastructure, Arcfra Enterprise Cloud Platform (AECP) provides integrated compute and storage services, thereby enhancing the effectiveness of VM HA through storage features such as multi-replica strategy and rack awareness (topology-aware data allocation). Additionally, it offers rich VM rebuilding services that enable IT engineers to address various server and VM failure scenarios through an optimized VM HA function.
In the following sections, we will first explore how to ensure the availability of an enterprise cloud platform like AECP, followed by an introduction to the VM HA designs and functionality in AECP.
Like traditional virtualization platforms, an enterprise cloud platform with hyperconvergence that integrates compute and storage infrastructure provides the VM HA feature that automatically rebuilds affected VMs and migrates them to other healthy hosts within the cluster in the event of a server failure. This helps to reduce downtime and minimize service interruptions, eliminating the need for dedicated backup hardware and additional software installations. Key stages of VM HA generally include:
However, differences still exist in the design and implementation of VM HA between HCI-based enterprise cloud platforms and traditional virtualization.
Once a failure occurs in a virtualization platform, to transfer and resume VMs on other hosts, VM data must be stored in a shared storage device accessible to all hosts. This means VM HA on a virtualization platform heavily relies on shared storage (such as external storage devices like FC-SAN or IP-SAN). In traditional architecture, as hosts and shared storage operate independently, storage availability is ensured by the storage device itself. Consequently, the HA feature of the virtualization platform does not involve storage availability capabilities.
In contrast, enterprise cloud platforms like AECP converge compute virtualization and software-defined storage on the same host. This convergence introduces the complexity of designing HA as it requires ensuring the availability of both compute virtualization and storage.
Server node failures can directly impact the functionality of VMs:
Regardless of the failure types, it is highly likely to affect the normal operation of business services and requires VM HA to restore business services as quickly as possible.
Arcfra Cloud Operating System (ACOS, AECP’s foundation software) provides data redundancy strategies including data replication and erasure coding. Under the replica strategy, all written data is automatically replicated into multiple copies (optionally 2 copies or 3 copies), and different copies of the same data are written to separate servers. Therefore, in the event of one or more server node failures, one or more copies of the software-defined storage will go offline.
Under the EC strategy, data/parity blocks on the failed server node will also go offline. We will explore this topic in our future blogs.
Given the complex and varied nature of failures in production environments, VM HA must accurately identify failure scenarios. Incorrect or overly frequent HA triggering can negatively impact business continuity. To address this, VM HA in ACOS is designed to handle a wide range of failure scenarios.
If a server node goes down and the VMs on it are passively shut down, the system will automatically trigger VM HA. It will obtain a list of VMs from the failed server and rebuild these VMs on other healthy nodes in the cluster to resume business operations.
If all the storage network ports of a server node fail, this node’s hard disk devices and data will go offline and the node cannot communicate with other healthy nodes in the cluster. At this time, the network heartbeat will go timeout. If the network outage lasts for 9 detection cycles (90s), the system will trigger the node to isolate itself and actively suspend the VMs on the failed node (if the suspension is not possible, these VMs will be shut down directly). Subsequently, the system will rebuild these VMs on other healthy nodes to resume business operations.
ACOS also supports the detection of VM network failures. If all business network ports on a node fail while the storage network ports remain functional, the system will trigger an HA within 60 seconds. In this scenario, the VMs will not be restarted. Instead, the system will automatically migrate the VMs from the failed node to other healthy nodes in the cluster.
In case of VM guest OS failure (no I/O, no network activities, and abnormal proceedings), if VM heartbeats have been lost for 20/50/110 seconds with total I/O and NIC package number kept the same for 10/40/100 seconds, the system will also trigger VM HA, reboot VMs and record failure issues.
If a power outage occurs in the server room, causing all nodes in the cluster to go offline simultaneously, the system cannot initiate VM HA. However, once power is restored, the HA function will be retriggered. Since there is no hardware failure, the system will prioritize restarting the VMs on the original nodes to restore normal VM operations.
In addition, for specific VMs that users do not expect an automatic trigger of the HA function, ACOS provides a VM HA switch feature. This feature allows users to enable or disable the HA function for each VM individually. In the event of a node failure, VMs with the HA function disabled will remain powered off, allowing users to choose to manually restart them.
HA Trigger Timeline:
Theoretically, when HA is triggered, the system can restart the VM at a randomly selected host in the cluster as long as it’s healthy. However, considering that not all hosts in the cluster have identical hardware configurations, restarting VMs on another host may disrupt normal business operations if they are sensitive to the host’s hardware environment. Therefore, ACOS provides a fine-grained VM rebuilding mechanism.
The essence of creating a virtual machine placement group is to enforce placement rules for a group of virtual machines so that they will be placed on proper nodes without unexpected interruption every time they are powered on, migrated, or rebuilt after high availability is enabled. Applicable scenarios include:
1) HA for business services
Multiple VMs supporting the same business should not be placed on the same host when implementing application-level failover. Otherwise, a single host failure may affect the business continuity. In this case, users can leverage the placement group policy and set relevant VMs to get rebuilt on different hosts when HA is triggered.
2) Business service is sensitive to the host CPU
For the above scenarios, users can set the placement group policy to ensure that VMs will be rebuilt on a designated host (with specific CPU resources) after triggering HA.
3) Businesses with Specific Network Requirements
If VMs need to access a particular network, and only certain hosts in the cluster can access that network or network port, it is possible that the VM can not communicate properly after HA. In this case, users can set the placement group policy to ensure that the VMs will be rebuilt on the designated hosts (with specific network resources) after triggering HA.
When HA is triggered, all the VMs with HA enabled on the failed node enter the rebuild queue in random order. It does not prioritize VMs carrying critical business applications for recovery. Furthermore, a node failure will lead to a decrease in the overall resources (including CPU, memory, storage resources, etc.). If the remaining cluster resources are too tight to support all the VMs that need to be rebuilt, it may result in the rebuilding failure of crucial VMs.
To address this issue, ACOS provides the Failure Detecting Sensitivity and HA priority feature.
As previously mentioned, ACOS places multiple data replicas on different servers, which can withstand server hardware failures and automatically recover data using surviving copies. However, if all these servers are placed in the same cabinet and the shared PDU power supply fails, multiple hosts will go offline simultaneously, potentially causing the failure of multiple replica protection. Rack awareness feature can solve this problem by detecting the server storage topology (placed in different cabinets) and automatically placing data replicas on multiple servers located in different cabinets. Even in the event of a power failure in one cabinet, the system can retrieve the corresponding data copies from servers in other cabinets and trigger the data recovery process.
As previously emphasized, storage availability is the key to VM HA. The rack awareness feature not only enhances cluster availability but also improves the effectiveness of VM HA.
Overall, in the initial stage of failure, ACOS can accurately identify the failure scenario and carry out corresponding HA reactions to minimize the impact of HA switching. After triggering HA switching, the system will accurately arrange the VMs to be rebuilt on appropriate hosts based on predefined rules, with the rebuilding order being arranged according to the business’s importance. Moreover, with rack awareness, ACOS VM HA can ensure business continuity effectively.
To learn more about AECP features and capabilities, please visit our website.
Arcfra simplifies enterprise cloud infrastructure with a full-stack, software-defined platform built for the AI era. We deliver computing, storage, networking, security, Kubernetes, and more — all in one streamlined solution. Supporting VMs, containers, and AI workloads, Arcfra offers future-proof infrastructure trusted by enterprises across e-commerce, finance, and manufacturing. Arcfra is recognized by Gartner as a Representative Vendor in full-stack hyperconverged infrastructure. Learn more at www.arcfra.com.