Search

Huawei All-Flash Reliability for Mission-Critical Business

2018-08-13
397
0

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our product pages or contact us.

By Wang Jiaxin from Huawei

High-performance flash storage carries mission-critical business systems for various industries. If a problem occurred, enterprises would be hit hard. Qualix Group released figures to show the impacts of business interruption. In transportation, a one-minute stoppage would result in average losses of 150,000 USD, while that for banks would be 270,000 USD. The same one-minute stoppage for a telecommunications company would cost an average of 350,000 USD, while manufacturing would be hit with a 420,000 dollar loss, and security traders would top the list, losing 450,000 USD.

Therefore, ensuring mission-critical business continuity is a top priority for all-flash storage systems. Reliability, designed in an end-to-end way, is no easy task. Looking from media to systems and solutions, let’s examine how Huawei OceanStor Dorado all-flash storage can provide high performance and reliability for customers.

Disk-level reliability

SSD reliability is measured by examining the mean time between failure (MTBF) and annualized failure rate (AFR). The industry MTBF benchmark is between 2 and 2.5 million hours. Huawei raises the bar well beyond this, reaching 3 million hours between failures on its homegrown disks.

How does Huawei accomplish this feat and extend the life of its SSDs? Huawei has maintained long-standing cooperation with its vendors, such as Samsung, Micron and Toshiba, to ensure that components are manufactured according to Huawei’s solution design objectives. Another reason is the extensive cooperation achieved between arrays and disks, which combines a series of reliability designs (such as optimization in graphene dissipation technology (GDT), global wear leveling, and global anti-wear leveling).


1. Global wear leveling design

At the beginning of the SSD lifecycle, service loads are spread in a balanced manner throughout SSDs to avoid overloading specific disks. This leads to the idleness of some disks and premature retirement of others.


2. Global anti-wear leveling design (Huawei-patented)

At the end of the SSD lifecycle, when the wear on an SSD exceeds 80%, the anti-leveling mechanism takes the gradient wear to more than a 2% difference to avoid simultaneous disk failure. The service life of the system can be prolonged by gradually replacing the disks, ensuring sufficient time for system upgrade.

All-flash storage focuses on performance and efficiency. Like a giant container ship on the sea, all-flash storage continuously pursues higher speeds and a larger capacity. With continuous and stable operations, Huawei OceanStor Dorado all-flash systems can deliver 99.9999% reliability, providing the public with a lightning-fast, rock-solid platform.


3. System software optimization

First, at the algorithm layer, Huawei is the first vendor to commercialize the LDPC algorithm in SSDs. After years of optimization, Huawei now supports a 4 K ultra-long code algorithm. This brings the error correction granularity to twice of other SSD providers in the industry.

Second, at the flash chip layer, the number of erase cycles is limited in an SSD. The service life of an SSD can be prolonged if the number of erase cycles can be increased through algorithms. Huawei’s innovative adaptive program & erase (APE) technology automatically controls the erase strength and frequency of flash chips based on the amount of read and write data. In this way, the number of erase cycles can be effectively extended without changing costs or media granules, prolonging the SSD service life.

Third, at the data protection layer, while the storage controller system has RAID protection, SSDs also support two-dimension RAID groups with interleaving parity at channel and CE levels, ensuring chip-level failure data protection. The disk RAID and system RAID groups work together to conduct automatic data recovery if multiple chips of a single disk are faulty. Then, after being recovered, SSDs will be operational again.

System-level reliability

Achieving reliability is complex. In addition to the hardware structure design and software fault tolerance mechanism, the storage system must tolerate physical and logical faults and support quick recovery. This will prevent data loss caused by system faults and ensure businesses continue running stably.


1. Magnitude 9.0 earthquake-resistant design

The irregular seismic waves and intensified shaking caused by huge earthquakes will affect the stability and service life of electronic equipment. Huawei OceanStor Dorado all-flash storage has passed the magnitude 9 earthquake-resistant test run by China Telecommunication Technology Labs (TTL). This makes Huawei the only company to have done so and satisfy the TIL’s IT standards. Once an exception is detected, the system can also diagnose and rectify the fault quickly enough to prevent business interruption.


2. Tolerance of three-disk failures

Disk capacity increases linearly with disk reconstruction time. Traditional RAID 5 or RAID 6 technologies allow 5 hours for the reconstruction of 1 TB of data, and 80 hours for 16 TB. However if one or two more disks become faulty during reconstruction, systems running RAID 5 or RAID 6 are unable to cope, severely disrupting business. Therefore, traditional RAID technologies cannot ensure system reliability, causing data loss and business interruption.

Huawei’s innovative RAID-TP software technology is based on the Erasure Code (EC) algorithm. Parity bits support 1-, 2-, 3-dimensions and can tolerate 1 to 3 simultaneous disk failures. This means that in the case of three disk failures, the system will not suffer from data loss or service interruption. Currently, only products from Huawei, NetApp, and Nimble can tolerate the simultaneous failure of three disks.

Although NetApp and Nimble can tolerate simultaneous failures of three disks, they both use traditional RAID architecture with fixed data disks and hot spare disks. For these companies, hot spare disk reconstruction for 1 TB of data takes 5 hours. OceanStor Dorado employs a global virtualization system able to reconstruct the data in just 30 minutes, fulfilling the requirements of ultra-large capacity profiles.


3. End-to-end data integrity protection and tolerance of silent data corruption

In data access, any errors that occur can cause issues for data integrity when data is transferred through multiple components, channels, and complex software. However, such errors can only be detected in subsequent data checks and access. This phenomenon is called silent data corruption.

Often overlooked, silent data corruption has greatly impacted services, such as databases, that require absolute data integrity. Launched by Huawei, Emulex, and Oracle, the data integrity solution changes the traditional condition where hosts and storage systems protect data independently. This has been achieved by implementing end-end protection across applications, hosts, storage systems, and disks. As a result, this solution prevents silent data corruption for mission-critical businesses and eliminates potential down times.


4. Intelligent prefetch

When a disk detects block faults or even severe die failures, the storage system receives failure reports from SSDs and uses redundant data in RAID groups to rapidly reconstruct and repair damaged data, reducing data loss risks and ensuring system reliability.

Huawei all-flash storage systems can accurately query internal data, such as SSD data, and use innovative prediction algorithms to monitor and predict the service life of disks. The personnel in charge of customer businesses will be told that their disks need replacing before the disks become faulty or one month before the service life is exhausted.

Solution-level reliability

Huawei OceanStor Dorado all-flash storage supports multiple data protection technologies, such as snapshot, clone, remote replication, and active-active data protection. This allows it to implement data protection solutions from local or intra-city to remote disaster recovery. This solution provides high availability and non-disruptive storage data services for customers, preventing data loss caused by logical or physical disasters.


1. Lossless snapshot

Traditionally, COW-based snapshot technology requires data to be written to a location after being read and migrated to a new location. Therefore, such snapshot processes involve one read, two writes, and one metadata update. COW-based snapshot affects system performance due to performance loss during each data migration.

Huawei OceanStor Dorado all-flash storage implements lossless snapshot using ROW. When a snapshot is activated, data is written to the new location and the pointer of the mapping table is modified. Only one data write and one metadata update are involved, with data operation complexity being only 1/3 of that seen for COW-based snapshot. In addition, no extra data migration is required when the ROW snapshot is activated, resulting in no compromises in performance regarding production businesses.

In addition, OceanStor Dorado storage supports second-level periodic snapshot, which is superior to the minute- or hour-level snapshots used by competitors’ all-flash storage. OceanStor Dorado snapshot provides users with a more intensive and powerful continuous data management (CDM) solution, enabling real-time data protection.


2. Gateway-free active-active architecture

Huawei OceanStor Dorado storage adopts a gateway-free active-active layout, removing the gateways on both sides. This immediately reduces customer procurement costs and lowers possible failures, achieving reduced latency, improved reliability, and accelerated performance. In addition, the overall networking is greatly simplified, with the number of deployment steps halved, thereby shortening the delivery cycle.

Active-active architecture

HyperMetro is deployed on two arrays in an active-active profile. Data on the active-active LUNs at both ends is synchronized in real time, and both ends process read and write I/Os from application servers to provide the servers with parallel active-active access. Should either array encounter a fault, services are seamlessly switched to the other end without interrupting service access, achieving RPO = 0 and RTO ≈ 0.

In remote data protection scenarios, the active-active solution can be effortlessly upgraded to the data center solution in geo-redundant mode, requiring no extra gateways and causing no business interruptions. This allows it to deliver a huge reliability protection rate of 99.9999% for customers. Third-party sites can even use Huawei OceanStor converged storage systems to provide cost-effective DR solutions for remote DR centers that require only ordinary response times.

Summary

All-flash storage focuses on performance and efficiency. Like a giant container ship on the sea, all-flash storage continuously pursues higher speeds and a larger capacity. With continuous and stable operations, Huawei OceanStor Dorado all-flash systems can deliver 99.9999% reliability, providing the public with a lightning-fast, rock-solid platform.

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our website at e.huawei.com or contact us.

TOP