Enterprise cloud platforms differ in how data is read and written to the storage layer, thus showing distinctions in performance. As a result, understanding how I/O works in your platforms can benefit IT infrastructure planning and deployment (e.g., tailoring network bandwidths for I/O paths).
This article compares the I/O paths of VMware Cloud Foundation with vSAN and Arcfra Enterprise Cloud Platform (AECP) when using replica for data redundancy* by analyzing read and write I/O under various scenarios and their locality probabilities to explain the effects of the I/O path on storage and cluster performance.
*Note: both VCF and AECP support data redundancy strategies like replica and Erasure Coding (EC), resulting in different I/O paths under the two strategies. This article mainly focuses on VCF’s and AECP’s I/O path under their replica strategies.
As far as we know, the application layer transfers data to a storage medium (usually a disk) for persistent storage through the system’s I/O path. In general, this process involves a variety of hardware devices and software logic, with the overhead and processing time (i.e., latency) increasing as more hardware and software are included. So theoretically, the design of the I/O path has a significant impact on the overall performance of the application.
In a traditional three-tier architecture, writes are routed from a VM through the hypervisor, the host’s FC HBA card, the FC SAN switch, and the SAN storage controller before being stored on a physical disk.
The hardware devices and software logic involved in this process include:
As some enterprise cloud platforms converge computing, network and storage, writes are routed from a VM through the hypervisor, and software-defined storage (SDS), before being stored on physical disks locally, or over the Ethernet.
The hardware devices and software logic involved in this process include:
In legacy virtualization (with SAN storage), VM data must pass through the SAN network before reaching the SAN storage. Namely, 100% of data reads and writes need to be traversed across the network. In contrast, VM I/O on converged architecture is written to disks via the built-in SDS mechanism which is based on a distributed architecture. As a result, for converged architecture, data reading and writing occurs both remotely (via the external network) and locally (on the local disk).
Local reads and writes, with the same hardware and software configurations, should have a shorter response time than remote I/O because network transmission inevitably increases I/O latency and overhead even if low-latency switches are popularly deployed.
vSAN stores VM disk files (.vmdk files) as objects and provides a variety of data redundancy mechanisms, including FTT=1 (RAID1 Mirror/RAID5) and FTT=2 (RAID1 Mirror/RAID6). The following discussion is based on the commonly used FTT=1 (RAID1 Mirror). RAID5/6 is only applicable to all-flash clusters, whereas hybrid configuration is more common for converged deployment.
When using FTT=1, two VM disk (.vmdk) replicas are created and placed on two different hosts. Whereas in vSAN, the default size of an object is 255GB with 1 stripe. So, if a VM creates a virtual disk of 200GB, vSAN will generate a set of mirrored components that includes an object and a replica that are distributed across two hosts. Although there is an additional witness component, we’ll leave it aside for the time being as it does not contain business data. If the virtual disk has a capacity larger than 255GB, it will be split into multiple components on a 255GB basis.
Write I/O Path
I/O Path under Normal State
Instead of using a local preference strategy, vSAN distributes components to local and remote nodes randomly. Let’s look at vSAN’s write I/O using a 4-node cluster as an example.
For a 4-node cluster, there are 6 possible ways to place an object and its replica, with three remote cases (50% chance), which means two copies must be written over the network, and the others being hybrid (with one copy placed locally and another remotely). In other words, even in the best-case scenario, at least one copy is inevitably written over the network in vSAN.
Read I/O Path
I/O Path under Normal State
According to the technical data disclosed by VMware World1, vSAN’s read I/O follows 3 principles:
Because of vSAN’s across-replica load balance, even if the host where the VM resides has a replica (the probability of the VM having a local replica is 50%, according to Figure 4), 50% of the reading will still be done over the network. Furthermore, there is a 50% probability that the node where the VM is located has no replica and has to read all data from remote replicas. In short, under normal conditions, vSAN will not read necessarily from local replicas.
I/O Path in Downtime
When a replica fails in a cluster, all read I/O will go to the same replica because as the other one is lost, balanced reading will no longer be possible.
In this scenario, there is a 25% probability of reading all data locally, which is a shorter I/O path, and a 75% probability of reading all data remotely.
In addition, in the event of a hard disk failure, data repair is conducted by reading the only available replica. The default size of the component (255GB) and the number of stripe (1) in vSAN result in the centralization of the component onto one or two hard disks, making it difficult to perform tensive recovery. This is why VMware suggests users configure more stripes in storage policy to avoid the performance bottleneck of a single hard disk.
Arcfra’s distributed block storage ABS slices the virtual disk into multiple data blocks (extents), and enhances data redundancy with 2-replica and 3-replica strategies. Specifically, the 2-replica strategy has the same efficacy as vSAN FTT=1 (RAID1 Mirror). Therefore, we analyze the I/O path in ABS based on the 2-replica strategy.
With the 2-replica strategy, in ABS the virtual disk consists of multiple extents (256MiB in size) that exist as a set of mirrors. The default number of stripes is 4. Particularly, ABS features data localization which allows precise control of replicas’ locations; while one full replica is placed on the host where the VM is located, another is placed remotely.
Write I/O Path
I/O Path under Normal State
Again, we take a 4-node cluster as an example. Since ABS guarantees there will always be one replica on the VM’s node, it is certain that 50% writes occur locally and 50% remotely, regardless of where the other replica is placed. As a result, there is no 100% remote write in ABS in this scenario.
I/O Path Before and After VM Migration
ABS data locality can efficiently reduce I/O access latency as one replica is stored on the local host. But what will happen if the VM is migrated? In general, there are two possible scenarios after VM migration.
Scenario 1: after migration, neither replica is placed locally, which indicates the 100% remote write (66.6% probability).
Scenario 2: after migration, VM moves to where one replica is located, which indicates the 50% remote write (33.3% probability).
As shown above, it is more likely that I/O is entirely written remotely. To address this issue, ABS optimizes I/O paths; during VM migration, newly written data is directly stored on the new local node, and the corresponding replica is moved to the new node after 6 hours. Therefore, a new data localization for the migrated VM is achieved. It also solves the problem of post-migration remote reads.
Read I/O Path
I/O Path under Normal State
Owing to data localization, in ABS data is always read from the local host.
I/O Path under Data Recovery
Scenario 1: on the node where the VM is located, a hard disk failure occurs.
I/O access is quickly switched from local to remote node to keep the storage service as usual. While the data recovery is triggered, a local replica will be rebuilt to resume the local I/O access.
Scenario 2: a hard disk failure occurs at the remote node.
Reading I/O locally while recovering data by copying available replicas from local to remote node.
However, during this process, as the single available replica has to handle application I/O and data recovery at the same time, will it put too much pressure on the storage system? In ABS, the answer is no:
Based on the preceding discussion, we summarize vSAN and ABS I/O paths in various scenarios in the form below. In terms of low latency, 100% local access is the most preferred, followed by 50% remote access and 100% remote access.
Under the normal state and data recovery scenario, ABS has a higher probability of local I/O and a theoretically lower latency than vSAN. vSAN, though, performs remote I/O frequently at normal state, it provides better I/O paths for VM migration. ABS’ low probability of local I/O during the VM migration will be improved once the new data locality is complete (it takes at least a few hours).
In short, ABS shows a clear advantage if VM online migration is not frequently required (not every few hours). Otherwise, vSAN would be more preferable, although frequent VM online migrations are not common in the production environment.
Regarding the I/O path’s impacts on cluster performance, you may also have wonders as such: if total I/O access of Apps in a cluster is quite low, will the I/O path design still affect cluster performance? Do we need to necessitate local I/O reads and writes?
The answer is YES. Because remote I/O access not only increases latency but also takes up additional network resources. While under the normal state these network overheads may not cause heavy stress, the impact can be significant in scenarios such as:
After all, users may end up bearing additional costs (e.g., switching to a 25G network) to avoid network resource contention issues. This may not be a desirable solution.
For more information on ABS and AECP, please visit our website.
Arcfra is an IT innovator that simplifies on-premises enterprise cloud infrastructure with its full-stack, software-defined platform. In the cloud and AI era, we help enterprises effortlessly build robust on-premises cloud infrastructure from bare metal, offering computing, storage, networking, security, backup, disaster recovery, Kubernetes service, and more in one stack. Our streamlined design supports both virtual machines and containers, ensuring a future-proof infrastructure.
For more information, please visit www.arcfra.com.