This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search


Huawei DME Opens Up a New Era of Intelligent O&M for Enterprise Storage

Jun 29, 2021

Let's start with a hypothetical. What happens when you equip industrial equipment with flash arrays? You may be surprised to hear flash arrays are fallible, and through wear and tear, the circuit boards in flash controllers will slowly corrode or even dissolve. In fact, this may not even be the worst-case scenario, especially if the arrays don't support intelligent failure prediction, meaning it is often too late for users to notice a system outage or data loss.

You might think that these are only extreme circumstances or low-probability events. However, the growing trends of Internet-based and mobile business have caused a shift to 24/7 sales support in many industries. Any failure at the data infrastructure level can be devastating to the survival and growth of an enterprise. This is one of the main reasons why AIOps has soared in the data infrastructure field in recent years.

1.We Do Not Need Hindsight

AIOps was first proposed by Gartner in 2016. It uses big data analytics and machine learning algorithms to automatically analyze and study massive volumes of O&M data. By doing so, enterprises can implement exception detection, bottleneck hotspot analytics, and multi-dimensional relationship analytics. AIOps helps IT O&M personnel identify system exceptions and quickly locate root causes, and proactively predicts system running risks and generates alarms. It delivers continuous insights and optimization of IT infrastructure and services, making it a must-have when working with sensitive and digital workloads.

With in-depth convergence and comprehensive application of cloud computing, big data, and artificial intelligence technologies, AIOps has rapidly penetrated into various segments, especially at the data infrastructure layer. In fact, intelligent O&M is one of the most competitive areas playing out in the storage domain. In retrospect, HPE acquired Nimble Storage to build its cloud-based intelligent O&M platform InfoSight. At present, HPE is gradually introducing InfoSight into the entire storage and server line, covering Primera and Nimble storage, as well as ProLiant servers, Synergy composite infrastructure and Apollo systems. InfoSight is the unsung hero to the HPE system by providing global visibility at the infrastructure level, predictive analytics, and intelligent recommendations.

According to Gartner's research, in 2020, AIOps tools have become essential to enterprises in any industry, with a global adoption rate of AIOps of 50%. In addition, IDC released the report of "IDC FutureScape: GLOBAL AI Market 2021 forecast - China", and predicted that AIOps will become the new normal for IT operations by 2024, and that at least half of all large enterprises globally will leverage automated O&M solutions for key IT systems and service management processes.

Advancements in smart and digital transformation solutions mean that unattended data centers, factories, or unmanned driving are no longer a fantasy. From the perspective of the macro market, the boom in New Infrastructure in 2020 makes it an urgent need to build a new data infrastructure centered on "data + intelligence". In short, such infrastructure enables intelligent upgrade in a range of industries and services for high-quality economic development. In terms of micro level, intelligent O&M facilitates stable and efficient operations, and predicts potential risks and faults in data infrastructure, so that the entire system can be self-managed, -repaired, and -optimized. This improves the running efficiency of data infrastructure while accelerating development, and all at a low cost.

2.Intelligence – The Next Storage Trend

Big data is a new production means, and AI is the new productivity. Together, these new factors of production will profoundly change and even transform industrial processes from top to bottom. The COVID-19 pandemic reaffirms the importance of data insights and accelerates the development of the data intelligence market.

Based on the specific requirements of industry users, digitalization and intelligence are key factors in how financial enterprises can compete or succeed in the future. In smart finance scenarios, data infrastructure must be readily available, efficient, and stable, providing services that can be agilely released and iterated. This is possible based on intelligent O&M and automation of data infrastructure. CITIC Bank, who when improving and optimizing their data infrastructure, used Huawei DME data management engine to introduce secure and controllable automation and intelligence capabilities for converged management, service change, and unified O&M scenarios. With the DME engine, the bank significantly improves O&M efficiency and service agility.

In the electric power industry, grid companies pose higher requirements on storage management during service integration and digitalization. Due to historical reasons, a company adopts systems and devices from multiple brands and series, complicating management and causing O&M risks. To facilitate unified data storage management and improve efficiency, power companies must adopt an intensive and standardized management solution that provides automatic, intelligent management of storage O&M.

To sum up, from the perspective of market and technology development trends as well as customer requirements, the in-depth integration of storage and AI technologies is imperative to reduce storage O&M costs, complexity, and risks. Currently, smart storage startups are mushrooming, with intelligent storage O&M acting as the catalyst for change.

At Huawei, we are pleased to see that the entire storage industry is paying more attention to intelligent O&M. In December 2020, at the DOIT Annual Awards, DOIT established the 2020 AI Technology Innovation Award, which was awarded to Huawei DME data management and O&M automation solution. The solution was recognized for its innovation of introducing a three-layer AI architecture for automatic resource provisioning, intelligent O&M, and intelligent data mobility on data center storage networks, facilitating simplified O&M and agile service innovation. Undoubtedly, intelligence will become an important indicator for future storage.

3.Huawei Storage Leads AIOps Innovation

The storage industry must invest heavily in long-term R&D of core technologies and related capabilities. Huawei is an end-to-end IT infrastructure solution provider, and intelligent O&M is an indispensable capability in IT infrastructure. Huawei has accumulated many years of experience and successful practices in this field, and has been named a Leader in Gartner Magic Quadrant of the global primary storage for many times. In terms of capabilities, Huawei is on par with other major players in the storage field. As the role of intelligence in storage becomes more and more prominent, the Leader position testifies Huawei's expertise in intelligent O&M.

Huawei DME is an intelligent data infrastructure O&M platform that has earned plaudits from around the world. This platform implements three-layer AI collaboration with built-in device AI and cloud AI (eService), over a unified management interface, automatic closed-loop mechanism, and open APIs. This architecture enables data storage lifecycle management and O&M automation from start to finish, covering planning, construction, O&M, and optimization. DME is designed to simplify storage management and improve data center operation efficiency.

Based on the four paradigms of storage AI, Huawei storage continues to lead the industry in AIOps innovation.

Paradigm I

Workload fingerprints are used for data placement, data mobility, hardware expansion, and service expansion, among others, to improve service identification accuracy to over 80% and the resource usage efficiency by 30%. The policy-based SLA change function allocates better storage resources for mission-critical services and ensures stable service running.

Paradigm II

Knowledge graph can quickly locate recoverable units within minutes. It is used in scenarios such as full-stack topology visualization, VM-to-storage E2E analysis, alarm correlation/root cause analysis, and interference analysis between neighboring modules. Alarm correlation and one-click impact scope evaluation can improve analytics efficiency, which helps predict performance exception, conduct KPI correlation analytics, and recommend correction solutions. In 2019, on the network of a transportation group limited company, Huawei eService detects that the peak write latency of OceanStor 5500 V5 reached 190.66 ms, and automatically recommended the system to enable SmartTier to shorten the latency.

Paradigm III

A media fault (disk/partial disk faults, memory failures, optical module faults) can be predicted 14 days in advance, with an identification rate of over 80% and a mis-reporting rate of lower than 0.1%. To predict HDD and SSD faults, data generated within 14 days before the fault occurs is selected as samples to analyze data distribution, growth trend, feature correlation, and importance using large-sample learning methods. Training and testing on 110,000 disks show that the effect is significantly better than random sampling. In June 2018, Huawei data center predicted three faulty disks within a month before they impacted services. In addition, the AI algorithms help identify memory fault modes and predict memory faults, and memory bank isolation measures are used to implement self-recovery and pre-warning of memory faults, thereby reducing system breakdowns.

Paradigm IV

The performance and capacity prediction accuracy is higher than 85%. It is mainly used for performance bottleneck prediction, capacity bottleneck prediction, KPI exception analytics, and network subhealth analytics. For instance, storage resources (front-end ports, controllers, caches, storage pools, LUNs, and disks) are monitored in real-time for performance exception detection. This is used to learn change trends of performance indicators (IOPS, latency, and I/O bandwidth) in advance, helping customers take proactive measures to reduce the performance accident rate. Capacity trend prediction is implemented by the integrated learning technology, specifically, time series-based prediction algorithms are weighted to capture the characteristics of multiple time series for accurate prediction. The convolutional neural network (CNN) is used to predict service performance trends. A model is generated through large-scale training using the performance data of LUNs on the entire Huawei network over the most recent three month period. Another innovation is performance tidal analytics, which displays service pattern in heat maps. This tech helps a municipal government select the optimal upgrade window where service volumes are at the lowest, avoiding adverse impacts on services during peak hours.

The above examples show that AI technologies have played a critical role in optimizing storage infrastructure, configuring storage resources, and improving performance, as well as automation and intelligence. With more industry giants exploring new depths of storage intelligence, it has become essential for enterprises to build an intelligent data infrastructure for intelligent transformation.