Since beginning an internal cloud-based transformation in 2014, the Huawei private cloud has grown exponentially in scale, and corporate users are spread across the globe. The range of available services has become more diversified and now includes office productivity, eCommerce, technology development, and product testing. To achieve this current level of success, the number of network devices in cloud data centers has had to increase by 50 percent every year — so that there are more than 20,000 network devices in over 10 data centers.
Every year, changes to Huawei’s internal network policies involve approximately 500,000 lines of software code, not to mention heavy workloads in short periods of time to meet the demands to update data center migration and device replacement activities. In addition, more than 500 operational changes have been applied for version or patch upgrades, device replacement, configuration, optimization, or emergency drills. How does a network O&M team of only 10 people deal with such an intense workload? And, in the era of cloud computing, how does the company maintain a 99.999 percent level of IT system availability?
Cloud Data Center Network O&M
Modern cloud data center networks have developed to incorporate four primary characteristics: service-oriented, virtualized, automated, and intelligent. Networks are provisioned as services that allow users to scale resources up and down on demand. By deploying networks automatically, it is a straightforward operation to originate and modify different clouds based on policies tailored to specific tenant requirements. The important technologies that separate the underlay from the overlay to make this level of automation possible are Software-Defined Networking (SDN) and Network Functions Virtualization (NFV). Resources are managed centrally and network intelligence is visualized. Though the four characteristics make it possible for cloud services to be fast, flexible, and elastic, their development is unbalanced, in part because user-oriented features often take precedence over O&M-oriented ones. The result is that SDN/NFV networks face great O&M challenges in the course of the rapid transformation to cloud computing:
Network device numbers are growing and O&M manpower is limited
Traditional O&M has low automation and relies on skilled personnel
Networks are increasingly complex, posing difficulties for availability
Limited breadth and depth of network alarms cause excessive alerts
The closed-loop O&M capability guarantees availability, efficiency, service-level improvement, and customer cost reduction. The result is a core competitive strength for Huawei in the market for cloud data center networks as well as pointing to a future where network O&M is increasingly automated, intelligent, and operates unattended.
Building Intelligent O&M Platforms
Intelligent network O&M platforms have four targets:
Locate faults within seconds
Isolate faults and self-heal in minutes
Predict and optimize network quality
Automate full O&M lifecycle
Huawei’s intelligent network O&M platform consists of three integrated sub-systems: network monitoring, intelligent analysis, and automation. These three components form a closed-loop O&M system that achieves the goal of unattended operation, as illustrated in the framework below.
Intelligent network O&M platforms have the following features:
Traditional network O&M platforms provide many tools that are independent of each other, as compared with the Huawei intelligent network platform that serves as an open platform for enabling closed-loop automation — from adding and monitoring network devices to information collection and analysis, alarm reporting, and fault self-healing.
The number of network elements in cloud data centers is increasing exponentially. The result is an enormous jump in the amount of information that must be monitored. For instance, Huawei IT cloud data centers had fewer than 40,000 network monitoring metrics in 2014; however, by 2017, this number has increased to over 10 million, bringing great challenges to the monitoring and collection system.
The depth, coverage, and frequency of network monitoring are also greatly improved, the collected information is more accurate, and the calculated responses are more effective. Past monitoring systems were only focused on key information, where today the goal is to collect as much information as possible. It is proven that the more information collected, the more effective the monitoring and analysis. As an example, if the network traffic sampling frequency is changed from once every 300 seconds to once every 10 seconds, the peak data rate will jump from 1.29 Gbit/s to 8.3 Gbit/s, or five times greater than the original value. The benefit is that many previously hidden problems will be discovered due to the extra effort.
Monitoring data is no longer isolated, and data from multiple collection systems may be integrated for correlation analysis. In the past, Simple Network Management Protocol (SNMP) and logs managed separately. In contrast, all monitoring data is now consolidated onto a single data platform for analysis across multiple dimensions by time and device.
Intelligent analysis platform
Fault prediction: The function of traditional network management systems is to monitor resources. The question is whether potential faults can be identified before they happen. Now, many Internet enterprises can predict disk faults, with prediction accuracy reaching 90 percent and higher. Areas for improvement include predicting the failure modes for consumable optical modules and gaining insight on ‘unpredictable’ service surges.
Correlation analysis: When tracking fewer than 40,000 metrics, about 40 alarms were generated per day. Now exceeding 10 million parameters, on the order of 10,000 alarms are generated per day. Correlation analysis has been determined to be the method of choice to track and manage these growing data sets.
Fault analysis: The prerequisite to having a self-healing network is automated fault analysis, and Big Data tools are making these automated processes more effective. Using the layer-2 network loop as an example, Huawei has used expert system support to locate faults manually. In the old way, when faults occurred, analysis tools were used to log-in to the failed device to collect information. Due to the tools’ inability to recognize changes in network architecture and topology, the process was complicated, inefficient, and less than accurate. Presently, by collecting the API information of all devices, network-based tools are able to conduct complex statistical analysis over the layer-2 network loop. Using this technique, faults are located in near real-time.
Service analysis: Cloud resource pools are deployed across multiple data centers. By using intelligent analysis on service and application processes, the traffic models of all Virtual Machines (VMs) will be revealed. The goal is to physically cluster application resources in order to conserve network resources. Service analysis capabilities include resource scheduling, security policy recommendations, and application-association, service-impact, and fault analysis.
Huawei O&M Platform Practices
Network automation covers 22 O&M scenarios, including policy configuration activities that frequently create high labor costs (i.e. addition, deletion, modification, health checks, power-off maintenance, and transition-to-production acceptance). Internally, over Huawei’s corporate intranet, firewall policies are adjusted more than 150,000 times every year. Fulfilling this workload using traditional O&M methods would require the company to direct all technical labor resources to this single task. Following the deployment of the automated O&M platform, large-scale interventions are no longer required at Huawei, and all policies are able to be realized consistently, rationally, and in full compliance.
Automated tools for large numbers of devices is much different in practice than using traditional scripts. Automation involves device configuration plans that account for multiple scenarios to resolve known inefficiencies between programs and devices. Using rules for automatic program decoupling, non-blocking sockets, and thread optimization and control, configuration provisioning of 10,000 devices can be finished in 20 minutes.
Relying on the native capabilities of Cacti, the open-source, Web-based network monitoring and graphing tool, a single server can monitor 300,000 metrics in five minutes — therefore, 30 servers are needed to monitor nearly 10 million metrics in the same period. Huawei optimized the Cacti data storage and alarm algorithms to enable each server to monitor 2 million metrics. Ergo, only five servers are needed to monitor 10 million metrics. In addition, Huawei designed a loosely coupled master-slave architecture to share the load of monitoring data collection. Further, Huawei ensures the consistency and centralized distribution of data templates by deploying MySQL databases in cluster mode. The result is a scalable tool architecture, and data instrumentation that can be displayed and queried on a Web-based console. Backend data collection, SQL server maintenance, and data storage are performed on separate servers.
The Huawei network log system is built to collect device logs on the live network in real time. The system monitors keywords, reports alarms, and preprocesses 15 million log records every day. The application transforms the logs into structured data by extracting information such as time, type, level, and keywords. The readability of the log files is enhanced by importing device and network information in the Configuration Management Database (CMDB).
Intelligent network analysis
The number of actionable alarms generated by the Huawei cloud data center network has decreased from 10 per day in 2014 to 0.5 per day in 2017. This 20-fold improvement is due to the continued optimization of alarm thresholds based on Big Data analysis of historical alarms that have combined methods to reduce invalid alarms, including alarm filtering, deduplication, and flapping control.
Statistics show the annual failure rate for optical modules to be about 2 percent — the highest component loss in the data center network. Packet losses suffered by faulty optical modules incur a severe negative impact on services.
To solve this problem, Huawei closely monitors the metrics that affect the operating status of all optical components — for which 80,000 pieces of run-time information are collected each day. By combining machine learning with time-series analysis, Huawei IT staff continually improving the detection algorithms that predict component failure. Historical snapshots are used as samples for further optimization.
Cloud networks have introduced a new age of network O&M. Huawei continues to research O&M applications for intelligent networks by developing automated smart analysis scenarios for closed-loop systems. Based on optimized prediction routines, Huawei network O&M systems are built to locate faults within seconds, and to isolate and self-heal those faults within minutes. Most important, for the customers who choose to adopt the intelligent network O&M platform, the automation system is built to operate across every contingency throughout your solution lifecycle.