In the second half of 2021, one of the world's most influential Internet companies suffered from a system-wide, global service collapse. As a result, its suite of key social apps was forced offline for more than 24 hours, leaving billions of users stranded.
Given the severity of the event, allied to the fact that the length of downtime was unprecedented, many began to draw their own conclusions as to the reasons behind it, with focus falling on remote management and control. Following up, enterprises operating in similar fields rushed to reassess the robustness of their own systems, eager to avoid such a disastrous failure. But the incident in question is not an isolated case, and the key to avoiding such situations is actually rather simple at the end of the day: enhance network Operations and Maintenance (O&M) capabilities.
As technologies such as virtualization and cloud computing become more widely used, the network complexity of data centers — the production centers of enterprises — increases exponentially. But it's simply not possible for the number of O&M engineers to increase accordingly, meaning that traditional O&M methods, based on the experience and expertise of team members, are unable to meet the O&M demands of the digital era. What does this mean? It means that the entire industry is now turning to transformation, moving toward intelligent O&M models in order to address the operational concerns of the current day.
One example of such a model is Huawei's iMaster NCE-FabricInsight, a Data Center Network (DCN) analyzer capable of intelligently analyzing thousands of disordered network and service data records, to introduce an all-new intelligent network O&M experience.
Traditional O&M practices, using the securities industry as an example, require three engineers to perform routine inspections for one hour before the market opens each day.
Yet, since inspections are conducted in a non-service period — prior to the market opening — they can never accurately reflect the real-time network status when services are actually live. In addition, the collected data is fragmented, meaning that an association between metrics can't be determined. And, of course, manual analysis is required, making it hard to identify a sudden data increase or decrease based on fixed thresholds.
In contrast, implementing intelligent O&M, iMaster NCE-FabricInsight collects multi-source data based on telemetry, introduces knowledge graph technology to network O&M, and associates massive network device metrics to comprehensively evaluate the health of multi-cloud and multi-vendor networks, all in real-time. In addition, it deploys Artificial Intelligence (AI) technologies to proactively detect abnormal changes in network behavior, predict risks relating to future traffic and capacity, and automatically generate evaluation reports and send warnings by email.
Following traditional O&M practices, it takes 76 minutes — on average — to locate a fault based on the expert experience of O&M staff, with the entire network needing to be checked, bit by bit, to ensure that it's fault-free.
Implementing intelligent O&M, iMaster NCE-FabricInsight is constructed based on more than 30 years of O&M experience, as well as the knowledge gained from addressing thousands of customer network issues. On this basis, it summarizes typical faults and performs continuous drills. It operates a self-learning model based on knowledge graphs, accumulating and growing the fault knowledge base. Currently, iMaster NCE-FabricInsight can detect 75 types of typical faults within just 1 minute, automatically locating root causes within 3 minutes, alongside inferring unknown faults. In addition, iMaster NCE-FabricInsight can interwork with iMaster NCE-Fabric — Huawei's network automation and intelligence platform — to intelligently analyze the impact of faults and recommend troubleshooting plans, enabling typical faults to be efficiently rectified in under 5 minutes.
Following traditional O&M models, before and after a change, device configurations and entries need to be manually compared. This is obviously inefficient. Plus, ping and traceroute are used to verify service connectivity before and after changes, since it's difficult — verging on impossible — to manually traverse thousands of applications on the network.
Switch to intelligent O&M and iMaster NCE-FabricInsight automatically compares configurations, entries, performance, and topologies before and after changes, delivering a tenfold improvement in verification efficiency. It also establishes a network-wide forwarding model and uses a formal verification algorithm to verify intents. As a result, the connectivity of tens of thousands of services can be verified in seconds. How the process works is similar to navigation on a Geographic Information System (GIS) map, where users simply enter a starting point, a destination, and any necessary intermediary points along the way. Available routes, along with those not advised, are automatically displayed, ensuring zero change risks.
Using traditional O&M, applications and their associated data are isolated from the network system, meaning that it takes several days for multiple departments to cooperate with each other in order to locate exceptions, especially poor Quality of Experience (QoE) faults. It's clear that Information Technology (IT) O&M departments really do need to build an integrated O&M system.
Intelligent O&M provides just that. iMaster NCE-FabricInsight provides over 100 open data services and can roll out scenario-specific applications in mere minutes, with drag-and-drop operations, shortening the integration period with third-party systems from months to weeks, overcoming all issues relating to the complex integration of multiple sets of O&M data.
Huawei and Netis — service and network performance management specialists — work together to build service-level intelligent O&M using iMaster NCE-FabricInsight and CrossFlow Business Performance Center (BPC), providing integrated service and network O&M capabilities for underlay and overlay networks, and applying intelligent O&M across diverse industries, including the finance and government sectors.
In today's digital economy, stable networks are the foundation for business success. Huawei iMaster NCE-FabricInsight, one of the key products within Huawei's CloudFabric 3.0 Hyper-Converged DCN Solution, has already seen successful commercial use in diverse industries, from finance to education, as well as in the government sector and by large enterprises. It provides all-scenario intelligent network O&M based on AI, implementing flexible service-oriented integration with partners and industry customers alike, improving the network O&M experience and helping customers achieve business success.
Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our website at e.huawei.com or contact us.