products
An In-Depth Look at Arcfra Erasure Coding: Configuration Strategies, Performance, and Best Practices
2025-02-13
Arcfra Team

Arcfra Enterprise Cloud Platform (AECP) allows users to choose from replication and erasure coding (EC) strategies for data redundancy protection. Unlike data replication, EC calculates parity blocks (m) for multiple data blocks (k), eliminating the need to store complete data replicas, thereby saving storage space. When some data blocks are damaged (≤m), the corrupted data can be reconstructed using the available data blocks (k) and parity blocks.

In our previous blog, we explained the storage I/O path when using EC for data redundancy. Below, we will further dive deep into the configuration strategies for using EC, and provide the best practices for using data replication and EC in AECP through comparisons and real cases.

EC Configuration Strategies

Since EC does not store full redundant data blocks, it can significantly reduce storage space compared to replication, given the same level of fault tolerance. However, more nodes are required to place the coded blocks and ensure that the remaining nodes can be used to recover the coded blocks in case of node failure. With the same fault tolerance, as the number of EC data blocks increases, space utilization improves, but the number of nodes required also rises.

image.png

* Under the EC strategy, data redundancy is configured as k+m, with space utilization calculated in k/(k+m). The minimum node configuration is k+m+1, and the recommended configuration is k+m+m.

On AECP, users are allowed to configure EC’s fault tolerance (m) from 1 to 4 (1 or 2 is recommended). The tolerable failure scenarios depend on the number of master nodes in the cluster, as well as the redundancy policies of the cache tier and the capacity tier, as outlined below:

image.png

The number of data blocks (k) must be configured as an even number for a convenient I/O splitting. For example, if the strip of a volume is 256KiB and the EC block is 4KiB, an even number of k ensures that the volume will correspond to more than one full strip after splitting. For m = 1 or 2, even k can range from 2 to 22; for m = 3 or 4, even k can range from 4 to 8. Thus, AECP supports 28 different erasure coding ratio schemes.

When selecting a ratio scheme, it is recommended to use a more reliable storage strategy as much as possible. If there are multiple strategies of equal reliability, users should choose the one with higher space utilization.

Regarding the node number, EC requires a minimum configuration of k+m+1 nodes. This is to ensure that after a single node failure, the remaining nodes can restore the data to meet the expectation; and in other tolerable failure scenarios, the data can be read through the surviving coded blocks. When m > 1, to improve the reliability, it is recommended to configure the number of nodes as k+m+m, which ensures expected data recovery in case of 2-node failure or m-data-disk failure. For different node counts, the recommended EC configurations are as follows:

image.png

Additionally, to gain rack-level fault tolerance, users need to place the nodes into different racks, which requires deploying a greater number of racks.

EC vs. Data Replication: Comparison of Performance

Will the adoption of different data redundancy strategies affect AECP’s performance?

Both EC and data replication write new data to the write cache as replicas, and the system reads from the write cache if the data is not sunk to the capacity tier. In this scenario, there is no difference between the two redundancy mechanisms in terms of storage performance.

However, if the data is sunk to the capacity tier, EC needs to read data across nodes, resulting in a reduced read performance compared to that before sinking. While adopting the data replication, due to its I/O localization capability, the read performance will not be impacted.

Read performance when using EC strategy

Before data is sunk to the capacity tier — 4K random read IOPS averaged about 319K*, the same as that of using replication.

1.png

Data is sunk to the capacity tier and then promoted to the cache tier — With all data promoted to cache, 4K random read IOPS averaged around 200K*.

2.png

* The test data may vary depending on the hardware and EC configuration.

When failure occurs, storage performance will degrade under both data redundancy strategies. For the storage using data replication, the system will directly read from surviving data replicas, leading to a performance decrease due to cross-node reads. Under the EC strategy, as the system needs to calculate the entire data block of corrupted data through available coded blocks before reading, the storage performance will also degrade.

The following graphs show the performance when reading data from the capacity tier (before data is promoted to the cache tier) under the failure scenario. In the test environment, as HDDs were used as data disks, they became the major causes of performance bottlenecks. When using the EC strategy, as the system can read data across multiple nodes, its breakdown performance is slightly higher than that of using the replication strategy.

Storage performance when using data replication

4K random read IOPS decreased by about 10%.

3.png

Storage performance when using EC

4K random read IOPS decreased by about 40%.

4.png

In summary, when using EC, the system’s write performance and hot-data read performance are comparable to that of using data replication. In some scenarios, the data replication strategy can have a slighter impact on read performance than EC. For performance-sensitive applications such as databases, the data replication strategy is still recommended.

Summary: EC vs. data replication in features, performances, and applicable scenarios

image.png

Best Practices for Using EC and Data Replication in AECP

Here we demonstrate how to calculate the required resources and configure data replication and EC strategies based on a real case.

Project Requirements

  • Performance requirements: 8K random reads and writes (r:w = 7:3) 2500 IOPS/TB, with a minimum of 2500 IOPS per TB of raw capacity for mixed read/write
  • Fault tolerance: 2
  • Access protocol: vhost
  • Storage protocol: RDMA
  • Physical disk: All-Flash NVMe SSDs
  • Cluster capacity: 550TB

Calculation and Configuration

1. Calculate the maximum bare capacity of a single node based on performance requirements

Given:

a. Single node required mixed read IOPS is 224K, mixed write IOPS is 96K, total mixed read and write IOPS is 320K (performance depends on hardware configuration, for reference only).

b. Requires single TB IOPS ≧ 2500.

c. Node bare capacity up to 320K / 2500 ≈ 128TB.

A single node can be configured with sixteen 8TB NVMe SSDs, with a bare capacity of 8TB *16 = 128TB. The IOPS of a single TB of bare capacity = 320K / 128 = 2500 IOPS, which meets the performance requirements.

2. Calculate the available storage capacity on a single node based on system space usage

In all-flash untiered mode, each node must be configured with at least 2 data disks containing metadata partitions, and the remaining disks are configured as data disks:

a. Data disk with metadata partition: single disk system occupies 305GiB.

b. Data disk: single disk system occupies 20GiB.

c. Single node system occupies 305 * 2 + 20 * 12 = 850GiB = 0.83TiB = 0.91TB.

d. Therefore, the space available for storage on a single node is 128–0.91 = 127.09TB.

In all-flash tiered mode(single type of SSD), 10% of each node is used as write cache. Space available for storage on a single node = 127.09 * 0.9 ≈ 114.38 TB.

3. Calculate the minimum number of nodes in the cluster, assuming using 3-replica strategy

Adopt an all-flash untiered mode.

3 replicas’ space utilization rate is 33%.

Minimum number of nodes = 550TB/0.33/127.09 ≈ 13.11 for 550TB available capacity, thus 14 nodes are required.

4. Calculate the minimum number of nodes in the cluster, assuming using EC strategy

Adopt an all-flash tiered mode (single type of SSD).

The required number of nodes to be deployed with different EC ratios for 550TB available capacity is shown in the table below, with the minimum number (8 nodes) occurring in EC 4+2.

image.png

In summary, compared with using 3 replicas, configuring EC 4+2 can meet the capacity, performance, and fault tolerance requirements with fewer nodes (14 vs. 8), significantly reducing the hardware investment cost.

To learn more about AECP replication strategies, please read our previous article: Arcfra Data Replication Explained: An Enhanced Strategy with Temporary Replica

About Arcfra

Arcfra is an IT innovator that simplifies on-premises enterprise cloud infrastructure with its full-stack, software-defined platform. In the cloud and AI era, we help enterprises effortlessly build robust on-premises cloud infrastructure from bare metal, offering computing, storage, networking, security, backup, disaster recovery, Kubernetes service, and more in one stack. Our streamlined design supports both virtual machines and containers, ensuring a future-proof infrastructure.

For more information, please visit www.arcfra.com.