Cloud Brain Platform Simplifies IT O&M
The cloud brain represents a new application of Artificial Intelligence (AI) technologies for IT Operations and Maintenance (O&M). High availability, high performance information systems have long been the most important research direction in the O&M field. Recent years have seen frequent calls for intelligent O&M applications that can quickly locate the root causes of faults, predict capacity risks, and properly configure resources. The emergence of cloud computing has improved the ability of O&M activities to resolve pressing problems in enterprise information system management.
• Scenario 1: Fast Fault Locating
When faults occur, centralized alarm systems generate large numbers of alarm messages that are sent to O&M engineers through various channels, resulting in flooded screens and annoyed users.
Once a fault alarm is received, O&M engineers will process each message one-by-one, access the corresponding server, and search the run logs. With a 30-minute target to locate and resolve the fault, O&M engineers face mounting pressure with each passing minute. As business personnel track the process and demand quicker troubleshooting, O&M engineers face hundreds or even thousands of scattered alarms.
• Scenario 2: Predict Capacity Risks
Business departments often need to perform O&M activities at unpredictable times. When required, the departments can request IT O&M services anytime and anywhere. Even with an extremely tight deadline, the IT department must work properly.
Upon receiving a new request, the O&M team will quickly complete system checks in accordance with the manual of Standard Operating Procedures (SOPs). However, even the O&M director cannot be 100 percent certain that the expanded system can guarantee operation. Both the business and technical personnel strive for a timely and successful resolution to the business activity assurance process. Despite careful preparation, everything becomes meaningless if the system breaks down.
When can we enjoy peace of mind? When will the risks be completely eliminated?
AI is the cloud brain’s core. As defined by Gartner, AIOps stands for Algorithmic IT Operations. They add machine learning and other algorithmic capabilities to O&M applications. By leveraging existing O&M data (such as logs, and other monitoring and application information), machine learning can solve problems traditional automatic O&M systems cannot, improve the predictive accuracy and stability of the system, reduce IT costs, and improve product competitiveness — all of which lays the foundation for the next phase of automated O&M.
Common AIOps application scenarios include quality assurance, cost management, and improvements in efficiency. In one example, China Pacific Insurance (CPIC) and Huawei collaborated on alarm work order convergence and service trend prediction scenarios.
• AI-Based Alarm Work Order Convergence — ‘Inside’
Work order convergence includes offline model training modules and online work order convergence modules. Offline model training modules perform data collection and preprocessing, feature selection, and model training in sequence; and then work order convergence modules complete work order classification, information extraction, clustering, and root cause analysis.
In recent years, Huawei has been applying AI technologies to reduce costs and improve efficiency, and has gained extensive experience through successful cases in intelligent network and IT system O&M. Our algorithm selection process has leveraged Huawei’s successful experience in AIOps, and adopted machine learning and deep learning algorithms, such as the Long-Short Term Memory (LSTM), correlation data mining, decision trees, and random forest algorithms.
In addition, Huawei optimized algorithm performance and accelerated model generation using open-source algorithms. Huawei also improved the generalization capability and prediction precision of models through algorithm optimization.
The modeling and verification of this project indicated a 60 to 80 percent decrease in work orders based on the alarm data generated by different service systems. Currently, the convergence rate of alarm work orders exceeds 70 percent. According to O&M engineering assessments, the accuracy rate of convergence results has exceeded 90 percent.
• AI-Based Alarm Work Order Convergence — ‘Outside’
The cloud brain presents analysis results on an analysis dashboard. After the alarms from each architectural layer are analyzed by the cloud brain, the system outputs the convergence and source tracing results of the alarm work orders. Should a fault occur, the cloud brain makes the entire analysis process highly efficient. O&M engineers can directly identify root causes and perform troubleshooting operations using the automatic O&M platform, which greatly simplifies the analysis process.
• AIOps Practice: Business Volume Prediction — ‘Inside’
The key data for predicting business volume includes the number of life insurance issuance orders, vehicle insurance reports, settled automotive insurance claims, underwriting issuance orders, and CPIC life and property insurance calls from 2016 and 2017. The XGBoost, a boosting integration algorithm, was selected as the main modeling algorithm and demonstrated a significant effect on the prediction field.
Based on the XGBoost algorithm, a basic model of 2017 property and vehicle insurance cases was generated, and its prediction of case volume during the Spring Festival, National Day, and Minor Vacation were roughly correct. However, the model was still insufficient. New models were established for each mode using historical data, and the basic model was further adjusted. The result showed that the enhanced model’s error rate was 50 percent lower than that of the previous iteration.
• AIOps Practice: Business Volume Prediction — ‘Outside’
The system dynamically predicted a trend toward service and technical personnel in real time through Kanban boards. This helped warn of risks and minimize the impact of service changes on IT resource support. Adopting predictive Kanban boards has become an O&M development trend. As to the Normalized Root-Mean-Square Deviation or Error (NRMSE) — the difference volume/average daily transaction volume — the cloud brain’s performance reached the reference range with an error rate of less than 30 percent.
After the alarm convergence model has been integrated into the current alarm platform, the system will connect to the automatic O&M platform and implement intermediate processing capabilities like alarm combinations, and fault and correlation analysis. O&M personnel will no longer need to manually review historical alarm information, and root causes will be located quickly and automatically. It is estimated that for every 700,000 alarm work orders in a year, the alarm convergence model will reduce the manual workload by more than seven person-years, and reduce troubleshooting time by 22 percent.
The integration of the trend prediction alarm model into the Kanban platform will facilitate the collaboration between business and IT departments, predict the system capacities required to respond to service changes, create running data archive files, and explore more extended applications.
In the two scenarios, the cloud brain’s model management platform provides online and offline running models with functions like structured and unstructured data processing, as well as image/text recognition. The models are constructed, trained, optimized, verified, and then released. The system can interconnect with other business systems through interfaces. In addition to the current knowledge base, the underlying technologies used for knowledge mapping will be upgraded to provide more efficient and accurate results. Huawei’s graph engine products will construct a knowledge graph to provide storage, multi-hop query, and relationship analysis capabilities for high-performance knowledge graph relationships.
• Cloud Brain Delivery
Kanban board: Visualized charts clearly display complex processes and data, correctly express data values, and transform the data into ‘user stories’ so that problems can be located and understood quickly. The system can filter details based on user requirements, gain insight into root causes, and provide decision-making capabilities.
Model: Through application scenario analysis, scenario data can be used to train and select algorithm solutions, and create models. The system sets tracing points, cleans and stores data, and selects features. It can filter out invalid information, minimize uncertainties, mine information, and maximize algorithm performance. Application scenario analysis can continuously measure model intelligence, as well as verify, train, and iterate data until preset requirements are met.
Engine: The engine enables service personnel to easily and efficiently configure real-time and quasi-real-time decision models and develop rules based on application scenarios. In this way, thousands of decision-making models and rules can be calculated based on massive streams of real-time data that meet the scenario requirements for high concurrency and low latency. Therefore, the engine is programmable, scalable, highly compatible, energy efficient, and elastic.
Interface: The cloud brain can provide standard interfaces based on actual system requirements; delivers analysis and reasoning services; and connects multiple platforms to make services faster and more convenient.
• Cloud Brain Application Scenarios
Inside the industry: Problem defect convergence, resource usage forecasts, user behavior predictions for insurance customization, cost (claim) forecasts, and fraud prevention through insurance system association.
Outside the industry: Massive information convergence, logistics forecasts, website traffic forecasts, sales forecasts, fraud prevention through insurance system association, and human flow forecasts.
The cloud brain focuses on essential customer requirements in specific application scenarios and addresses customer O&M pain points. Customers want stable services and smooth user experiences. The cloud brain can help O&M personnel quickly locate faults, repair faults at their source, and resolve problems in a timely manner.
According to Gartner’s analysis, the global AIOps deployment rate will increase from 10 percent in 2017 to 50 percent by 2020. In addition to the Internet, AIOps application domains will include High-Performance Computing (HPC), telecommunications, finance, electric power, the Internet of Things (IoT), healthcare, aerospace, military equipment, and networks.
Regarding intelligent O&M, the entire industry is still in the exploration stage, and many companies are taking a ‘wait-and-see’ approach before making their foray into the domain. However, there is no outsider when it comes to intelligence.
What is your role in the transformation toward intelligent O&M?