New Challenges in Traditional O&M Domains
With the rapid development of cloud computing, Big Data, and AI applications, increasingly higher requirements are posed on servers and computing capabilities. In turn, these factors have caused the construction of data centers around the globe to advance in both speed and scale. It is now common to see data center deployments that involve tens to hundreds of thousands of servers.
As reported by Gartner, global server revenue increased by 25.7 percent in the fourth quarter of 2017, and the server technology industry is in a prime period of rapid growth. With the blazing-fast development of services, IT infrastructures need to be quickly deployed, brought online, and conveniently managed. The management scenarios for massive numbers of servers are becoming more and more complex, and traditional Operations and Maintenance (O&M) domains face many new challenges.
• Challenges to Server Deployment: In data center expansion, migration, and consolidation scenarios, newly purchased servers need to be assembled, commissioned, allocated network resources, and have had their configurations provisioned. On-site manpower involves hardware installation and software deployment by O&M technical staff. Statistics gathered by Huawei IT service departments indicate that more than 50 percent of operating faults are caused by inefficient and error-prone manual operations that result in extra costs for labor and material.
• Challenges to Energy Consumption Management: According to the Climate Change News report, the total power consumption of global data centers in 2017 accounted for 3 percent of the total global power consumption, and the proportion is expected to reach 20 percent in 2025. Statistics show that energy consumption accounts for 35 percent of the data center Operating Expense (OPEX). Skyrocketing OPEX will become a global challenge. The result is that customer requirements for energy consumption management are mainly about the design of reliable power management policies to reduce energy consumption and effectively predict the cost of energy, which is critical to the precise investment of data centers.
• Challenges to Fault Prewarning and Diagnostics: In a traditional O&M mode, technical personnel reactively wait for faults to occur and then rectify them. In this old-school mode, the operating efficiency is 50 to 100 servers per person. As the scale of data centers continues to rise, faults will occur more frequently, and the associations between faults will become more complex. This will cause a proportionally lower efficiency if the industry is not able to move beyond the traditional modes of server maintenance. Further, traditional maintenance is based on alarm reporting, which means that problems are noticed and fixed only after critical thresholds are crossed. This, in turn, leads to service interruptions. Against such a backdrop, it is difficult to deliver on Service Level Agreement (SLA) guarantees of 99.95 percent or above.
How is Huawei Tackling these Challenges?
Gartner proposed the concept of Algorithmic IT Operations (AIOps), a novel form of intelligent O&M, in 2016. The global deployment ratio of AIOps was lower than 5 percent in 2016, but will reach 25 percent in 2019. In other words, intelligent O&M will become the new normal. The AIOps platform is defined by 11 capabilities, including historical data management, stream data management, log data extraction, network data extraction, algorithm data extraction, text and Natural Language Processing (NLP) document extraction, automatic model discovery and prediction, exception detection, root cause analysis, on-demand delivery, and software service delivery capability. These capabilities enable targeted solutions to the preceding pain points, and are the main development direction of massive data center server management.
Figure 1. Algorithmic IT Operations (AIOps) overview (Gartner, 2016)
AIOps are in the process of a long-term evolution. What we see are AIOps focused on detection and prediction based on massive machine data that turns reactive O&M into a proactive method. The optimization is mainly on the software side. However, delivering a material leap forward in aspects such as deployment, energy saving, and fault management requires a vital synergy between hardware and software.
In response, Huawei has put forward the concept of Intelligent Servers that integrate intelligent management chips and intelligent algorithms to implement more efficient server deployments, fault diagnostics and prediction, energy consumption management, mobile O&M, and version management.
The Huawei Intelligent Server is an integrated software and hardware solution that combines intelligent chips with an O&M platform and ‘integrated Baseboard Management Controller’ (iBMC) software.
Figure 2. Five major functions of the Intelligent Server
So, what are the advantages of this holistic hardware and software solution?
Compared to traditional OEM servers, Huawei servers provide intelligent management functions, such as single-node-level fault prediction and analysis, and intelligent power consumption management. In addition, the Graphical User Interface (GUI) is designed to present the operating status in a user-friendly and intelligent fashion. The result is a reduction of O&M personnel costs and an improved O&M experience. What’s more, Huawei Intelligent Servers allow maintenance personnel to access server O&M systems locally via Bluetooth and Wi-Fi to dramatically facilitate server deployments and fault location.
For example, in deployment and maintenance scenarios, the Huawei Intelligent Server provides a one-click Wi-Fi hotspot button. After arriving at the site, the maintenance engineer can touch the Wi-Fi hotspot button, use the mobile App to scan the bar code on the server to access the server O&M network, and then quickly access the server enclosure information and provision configurations. The maintenance engineer can also perform assembly and maintenance according to the guidance provided by the mobile App.
The Huawei Intelligent Server is based on a hardware platform that supports feature-rich intelligent management, which, in turn, greatly complements intelligent O&M scenarios. In many scenarios, the main bottleneck in the manual operation of O&M personnel is not that the desired information is lost in an ocean of data, but that the hardware itself does not support intelligent management. Intelligent Servers bridge the gap between hardware and software to better resolve fundamental problems that cannot be addressed by solely relying on software, as occurs with other O&M scenarios. Thanks to improvements in silicon chip capabilities, the information collected by the servers is more comprehensive and provides a more reliable reference for the intelligent O&M platform to make decisions.
For energy consumption management, the intelligent server integrates functions such as dynamic frequency modulation for CPU cores, fan speed tuning, and power supply hibernation. When service loads are low at night, users can set the energy consumption profile to the energy-saving mode. The Intelligent Server then will adjust the CPU clock frequency to limit power consumption within a specified range of values. The Intelligent Server can also hibernate some Power Supply Units (PSUs) to further reduce power consumption. When service loads are heavy during daytime peak hours, users can set the energy consumption profile to a high-performance mode. The Intelligent Server then will cancel all CPU frequency modulation restrictions and PSU hibernation configurations. In addition, the Intelligent Server will invoke the high-performance heat dissipation specifications for fan heat control to intelligently associate energy saving policies with real-world conditions. Combined, these features can save over 10 percent of the energy draw for a single server cabinet. The intelligent power consumption management platform also provides intelligent control of cabinet-level power consumption, where the power-capping value is recommended based on historical power records. In typical service scenarios, the density of single server cabinets can be increased by more than 10 percent.
The Intelligent Server inherits all the existing functions of the intelligent O&M platform and provides a new direction for the evolution of O&M. According to the implementation of the intelligent server solution, traditional O&M personnel will be freed from repetitive and low-value daily work. Manual operations can be optimized and the efficiency of onsite personnel boosted using automation. In addition, intelligent energy consumption and fault management capabilities can be used to enable higher fulfillment rates for SLAs to help customers save further on OPEX.
Inspired by innovations from the very core of Huawei’s silicon chips, Intelligent Servers better position data center customers for excellence and success in the future.