Reminder

To have a better experience, please upgrade your IE browser.

upgrade
If you need help, please click here:

Reliability Analysis on Video Surveillance Storage Systems

Video surveillance technologies are now penetrating every aspect of modem life. With the huge surge in security protection demands and wide application of high-speed networks, video surveillance systems are being transformed from simulated surveillance to digital and network-based surveillance. Conventional storage devices like tapes, digital video recorders (DVRs), and network video recorders (NVRs) cannot cope with the new challenges arising in this process, and professional storage arrays are called on to provide larger storage capacity, higher performance, and improved reliability.

A storage array allows multiple physical disks to form a virtual group, which is called a Redundant Array of Independent Disks (RAID) group. Data to be written to a RAID group is first sliced into blocks and then placed to the virtual member disks of the RAID group, and verifications are routinely performed. In addition, these member disks can work in tandem to handle a data read or write request, significantly reducing response latency and helping to protect data security. A common practice is to combine two storage arrays into a highly reliable storage array (dual-controller storage array), which enables mirrored channels to achieve service failover and failback in real time.

Huawei's approach to achieving robust storage reliability

As an enterprise with the most solid R&D capability in China, Huawei has invested billions into research and development of storage equipment. By inheriting the advantages of conventional storage arrays, Huawei's storage arrays are further optimized for video surveillance systems and Safe City projects, incorporating a variety of cutting-edge technologies in equipment, software, and disks to achieve zero loss of video data.

Huawei storage adopts a highly reliable design, precise manufacturing processes, and impressive protection and maintenance utilities, providing end-to-end assurance to ensure stability in system operations. Disks are the most critical component of a storage array and are used for processing data read and write requests. In a large-scale Safe City project, the video surveillance system usually accommodates a huge number of surveillance cameras. Surveillance data is retained in the system for long periods, imposing even higher requirements on the disks. To better serve Safe City agendas everywhere, Huawei thoroughly analyzed project requirements and then optimized its disks and other relevant components according to the general deployment scenarios involved. The optimized storage arrays achieve a disk failure rate far lower than the industry average of around 10%. The mechanisms Huawei applies to its reliability design, disk optimizations, and component enhancements are generalized in Figure 1.

Huawei storage reliability concepts

Figure 1: Storage reliability

A disk, or a hard disk drive (HDD), is a mechanical device with magnetic media used to store information. Its major components include platters, magnetic heads, an actuator motor, a spindle motor, interfaces, and a printed circuit board assembly (PCBA).

A disk is not a sealed device but connects to the external environment through a small breather hole (the hole is usually covered with a breather filter to prevent dust from entering). The fastest disks spin at 15,000 rpm.

The air density generated during the disk's high-speed rotation causes the heads to rotate at a flying height of 10 nm. The actuator motor swings the heads fro and back along the disk radius and precisely places the heads at an optimum position for data reads and writes (see Figure 2).

Parts of the disk

Figure 2: Disk components

Major factors causing disk faults:

  • The flying height of heads is only 10 nm; therefore, vibrations may cause the heads to collide with platters
  • If dust particles fall into the sealed enclosure of a disk, or collide with the heads or patters, the patter surfaces may become scratched
  • Temperature, humidity, environment contamination, and working altitude are the other factors that may result in disk failures

Table 1: Factors leading to disk faults

Factor

Impact

Source

Vibration

Interferes with head rotation, prevents precise data locating, causes heads to collide with platters.

Vibration

Dust particle

Causes heads to collide with platters, scratches platter surface.

Air dust, smoke

Temperature

Accelerates head aging.

Excessively high temperature

Humidity

Corrodes disk circuit boards.

Excessively high humidity;

Air pollution

Volatile materials like sulfur

Sulfur

Voltage

Interferes with head rotation, prevents precise data locating, causes heads to collide with platters.

Altitude

Technologies improving disk reliability

Quality control in precision manufacturing and inspection procedures

Based on its accrued experience in manufacturing electronic products and detailed analysis on disk characteristics, Huawei has worked out a series of stringent disk manufacturing and inspection procedures, and utilizes such advanced technologies as environment stress screening (ESS), aging testing, and built-in testing to locate disks at risk of failure. Figure 3 illustrates Huawei's disk inspection procedure.

Disk inspection flow

Figure 3: Huawei disk inspection procedure

Based on these precise procedures, up to 99.99% of potential disk faults are discovered before a fault event actually occurs, greatly reducing the disk failure rate.

Anti-vibration design

Storage systems are often deployed in complex and even adverse environments, and disk vibration becomes even more of a concern for systems using mechanical disks because the high-speed rotations of the disks will cause vibrations that lead to head and platter movement that in turn may shorten disk service life. To resolve these intrinsic defects, Huawei continuously looks for ways to optimize disk trays and overall system structures to eliminate the impacts from vibration and improve disk reliability.

Isolation of disk vibrations

By monitoring and analyzing disk vibrations, Huawei redesigns disk tray structures and improves its anti-vibration performance. This innovative disk tray design can efficiently absolve the vibration energy generated by both disk rotation and the external environment (see Figure 4).

Disk tray design features

Figure 4: Disk tray design

Insulation of fan vibrations

Horizontal and vertical cushioning materials are placed around the fans, brackets, and the enclosure, removing 40% or more of the vibration from fans (see Figure 5).

Horizontal and vertical vibration dampening devices

Figure 5: Fan vibration dampening

High-density design of enclosure and disk guide rail

The double-deck structure improves the strength of enclosures by more than 20% and guide rails are made of die-cast zinc alloy, helping keep disks in place and the closure intact in extremely adverse environmental conditions. The shock-resistant design of the material helps reduce the vibration transmitted from enclosures to disks, extending service life and improving system stability. Huawei equipment has passed magnitude-9 earthquake resistance testing, another testament to the level of engineering and precision in manufacturing that goes into each piece of Huawei storage.

Enclosure and disk rail design features

Figure 6: Enclosures and disk rails

Real-time monitoring of operating environments

The vibration sensor integrated on the midplane monitors vibration and shock in the system environment in real time. If a vibration or shock value exceeds the threshold, the network management system immediately notifies the administrator to take corrective measures, thereby preemptively avoiding system shutdown (see Figure 7).

Vibration sensors monitor vibration and shock

Figure 7: Vibration sensors

Non-disruptive disk diagnosis

Storage systems typically accommodate a huge number of disks and carry mission-critical data, thereby requiring fault diagnosis and rectification tasks to be transparent to services. How to promptly detect and efficiently handle disk faults to minimize the impact on service continuity is a challenge. Huawei storage arrays employ a disk fault management mechanism that integrates proactive prevention, partial isolation, and fast recovery to improve disk fault tolerance.

Preemptive prevention

Disk fault diagnosis and warning provided by Disk Health Analyzer (DHA): Disks are an important component of a storage array and become more prone to failure with the increase to service time, especially after long-term operation. The Huawei proprietary DHA is used to set up a disk fault model to monitor key indicators. It adopts advanced algorithms to assess disk health and predict disk faults. The DHA can dynamically use different assessment algorithms according to the working status of disks and service models of the storage system. Currently, Huawei employs the following three DHA modeling technologies to determine disk status:

  • Routine SMART information collection (single factor): uses disk SMART information for analysis and prediction.
  • Weibull distribution probability analysis (multiple factors): uses disk Power on Hour (POH) indicators for analysis and prediction.
  • Fuzzy comprehensive evaluation (multiple factors): uses comprehensive indicators including disk SMART information and I/O models for analysis and prediction.

Background bad sector scanning

Bad sectors are a common cause of disk faults; however, bad sectors cannot be proactively reported to the host, only detected during data reads and writes. Huawei provides a background bad sector scanning function that proactively detects and recovers bad sectors without affecting services or disk reliability, thereby reducing the risk of data loss. This function also allows users to set scan policies for specific scanning periods based on the physical parameters of disks and according to site particulars.

Online disk diagnosis

Huawei introduces online disk diagnosis into the disk troubleshooting process to analyze causes and impact scope of disk faults. If a disk becomes faulty, it will not be deleted immediately. First, fault diagnosis is implemented to check whether the disk is really faulty. In addition, the mechanism offers a variety of measures to restore disks and bad sectors, greatly improving service continuity and system reliability.

Partial reconstruction

Huawei provides a function called partial reconstruction to minimize the impact caused by a disk removal. If a disk is removed without going through the proper removal procedures and checks, data increments are recorded until the disk is reinserted into the system, and only the incremental data is written to the disk after it is reinserted. This function significantly reduces the amount of data to be reconstructed, shortens the reconstruction period, and minimizes risk of data loss.

Thin reconstruction

Generally, storage space assigned to users will not be used up immediately; that is to say, most storage space is in the idle state. Therefore, if a disk permanently fails, it is a waste of time and resources to reconstruct the unused space. Thin reconstruction only reconstructs valid user data, thereby reducing the construction time and impact on reliability.