products
Arcfra Data Replication Explained: An Enhanced Strategy with Temporary Replica
2024-11-14
Arcfra Team

Data replication is a commonly used data redundancy strategy for enterprise cloud platforms. It ensures that even if one or some replicas are abnormal, the storage system can restore them through a healthy replica.

However, the mainstream data replication design cannot avoid the risk of data loss during the replica recovery process. This is because, until the replica recovery is fully completed, the number of replicas in the cluster remains below the expected level, resulting in a degradation of the replica count. Consequently, if the healthy replica also fails or unintentionally goes offline during this period, it greatly increases the likelihood of data loss.

Arcfra Enterprise Cloud Platform (AECP) supports data redundancy strategies, including data replication and erasure coding (EC); moreover, it enhances the replica strategy by introducing the “temporary replica” mechanism, which prevents the degradation of replica numbers during the replica restoration process and ensures the higher stability of core business services.

Data Replication in AECP: Basic Rules

In AECP, the multi-replica strategy ensures that each piece of data has multiple identical replicas, which are distributed and stored across different devices based on established rules. This strategy is designed to mitigate data damage or loss from hardware failures. In the event of a hardware failure that causes one or more replicas in the cluster to go offline or become damaged, the remaining healthy replica(s) can continue to read and write data. At the same time, the system initiates the process of regenerating new replicas based on the available healthy data. This approach safeguards data integrity and ensures the continuous availability of the data.

2.jpg

Figure 1

Taking the two-replica strategy as an example (Figure 1), the storage volume is partitioned into multiple data blocks, each having two replicas. For instance, data block A’s replicas are stored on node 1 and node 2 respectively. This configuration ensures that even if one server (node) experiences a crash or failure, there is still at least one replica accessible and available.

In AECP, users have the flexibility to choose between two-replica and three-replica policies, each offering varying levels of resilience against hardware damage.

Enhanced Feature: Temporary Replica

Why It Matters

In addition to hardware damage, AECP clusters may encounter various issues, such as misoperation of hard disks, accidental server node restarts, and storage network disconnections. These problems can lead to temporary hardware disconnections (which can be reconnected after a while) and a degradation in the number of replicas.

3.jpg

Figure 2

Under a two-replica policy (Figure 2), for example, data A is synchronously written to two replicas. If Replica 2 experiences an abnormal disconnection, it will be removed, triggering data recovery to Replica 2'. During this phase, only Replica 1 (the healthy replica) can respond to I/O requests, resulting in a temporary degradation in the number of replicas. Once Replica 2' is fully rebuilt as Replica 2'’, the new replica (Replica 2'’) can resume writing I/O operations, and the number of available replicas meets the expected level once again.

During the replica recovery process, if the healthy replica is also damaged and gets disconnected, it is very likely to cause data loss.

4.jpg

Figure 3

For instance (Figure 3), if Replica 2 is abnormally disconnected, I/O can be normally written to Replica 1, with Replica 2 being restored based on Replica 1. During the recovery process, if Replica 1 also becomes inaccessible due to hardware damage or other reasons, it cannot be restored as there is no healthy replica available. In this scenario, any changes made to the data will not be written to any of the replicas. As all replicas are damaged, it will significantly increase the risk of data loss.

How It Works

To prevent the degradation of replica numbers, AECP introduces the “temporary replica” design. This new strategy can keep the number of accessible replicas the same as expected during the replica recovery process. Even if the healthy replicas also become abnormal during this period, data can be restored through a specific mechanism (supporting complete restoration and partial restoration), greatly enhancing data security.

Definition of concepts

  • Healthy replica: A data replica that can provide complete read and write capabilities.
  • Failed replica: A data replica that is abnormal and cannot provide read and write capabilities. Can be restored with the help of temporary replicas.
  • Temporary replica: During the replica recovery process, it is responsible for writing new I/O but does not allow data reading.

Temporary replica mechanism

  • When the number of replicas degrades and abnormal replicas need to be removed, temporary replicas are introduced to handle write requests and record newly written data during the data recovery process. This ensures that the expected number of replicas is maintained, where the complete replica data is composed of the temporary replica data and the data from the healthy replicas.
  • During the data recovery, the abnormal replicas are retained but marked as failed replicas. As each healthy replica is successfully recovered, the corresponding failed replica and its associated temporary replica are removed from the system.
  • If additional faults occur during the data recovery, causing all replicas to become abnormal, the data on the temporary replica can be integrated with the failed replica once it becomes accessible again. This integration results in a healthy replica with complete data. However, this process requires manual intervention and follows a specific mechanism.
  • It’s important to note that the temporary replica strategy can only prevent the degradation of replica numbers caused by recoverable faults, such as temporary disconnections of replicas due to network issues or other reasons.

Examples

1. Replica recovery with at least one healthy replica

Under the two-replica and three-replica policies, when a single replica becomes abnormal, a temporary replica will be assigned to handle write operations for the new data. Simultaneously, a new replica is created based on the healthy replica.

5.jpg

Figure 4

For example (Figure 4), if Replica 2 becomes disconnected and inaccessible, the system will mark it as a failed replica, initiate replica recovery, and allocate a temporary replica. All newly generated data during the replica recovery process is synchronously written to Replica 1 and the temporary replica. Thus, the replica count remains intact for the new data. Additionally, a new replica (Replica 2') is created by duplicating the data from Replica 1. Once the recovery process is complete, Replica 2'’ becomes a new healthy replica, and the failed replica and temporary replica will be deleted from the system.

2. Replica recovery with no healthy replica

If, unfortunately, the last remaining healthy replica also becomes disconnected, the system can merge the failed replica with the temporary replica once the failed replica is reconnected (such as when the host is rebooted or the network is re-established). This merging process results in the formation of a complete replica. However, it’s important to note that during the recovery period, the VM is still unable to respond to I/O requests.

6.jpg

Figure 5

For example (Figure 5), if Replica 2 experiences an abnormal disconnection, the system will automatically initiate data recovery and generate a temporary replica. New data will be written to both Replica 1 and the temporary replica, while Replica 2' is created based on the data from Replica 1.

In the event of additional faults occurring during the replica recovery, there may not be a complete replica available for access, resulting in a complete disconnection of data A. However, once the failed replica (Replica 2) is reconnected, the system can integrate the data from Replica 2 with the temporary replica containing incremental data, forming a replica with complete data (Replica 3). At this point, data A becomes reconnected, capable of accepting read and write requests.

The system will then reinitiate replica recovery based on the data from Replica 3, resulting in the formation of Replica 1'. After the recovery process is completed, data A once again has two healthy replicas (Replica 3 and Replica 1'’). This entire process does not involve any degradation in the number of replicas.

Limitations and Impacts of Temporary Replica Strategy

Limitations

The temporary replica is mainly designed to address the degradation in replica number caused by short-term hardware disconnection. To cope with unrecoverable hardware failures such as disk damage and multiple hardware failures, it is recommended to use a higher-level replica strategy (e.g., a three-replica policy) for data protection.

Impacts

The creation of temporary replicas will occupy additional storage space, which, however, will be automatically reclaimed after replica recovery is completed.

Explore more AECP features and capabilities in our previous blogs:

Arcfra Virtual Machine High Availability Explained

Arcfra vs. VMware: VM Snapshot and I/O Performance Comparison

Arcfra vs. VMware: I/O Path Comparison and Performance Impact

For more information on AECP, please visit our website.

About Arcfra

Arcfra is an IT innovator that simplifies on-premises enterprise cloud infrastructure with its full-stack, software-defined platform. In the cloud and AI era, we help enterprises effortlessly build robust on-premises cloud infrastructure from bare metal, offering computing, storage, networking, security, backup, disaster recovery, Kubernetes service, and more in one stack. Our streamlined design supports both virtual machines and containers, ensuring a future-proof infrastructure.

For more information, please visit www.arcfra.com.