Intelligent O&M in Future Cloud Data Centers
Enterprise products, solutions & services
In the cloud computing era, IT system builds are becoming an increasingly vital link to achieving development agendas. Business systems and the infrastructure supporting those systems are a prime point of concern for many enterprises. Operations and Maintenance (O&M) systems are the ‘heroes behind the scenes’ that keep these platforms up and running, making the O&M systems mission critical. In each IT system transformation, the greatest difficulties are often achieving the needed level of service assurance and enacting a viable O&M program, as well as moving to cloud architectures which present several more challenges.
As more and more enterprises embrace the cloud, O&M personnel will face more pressure than ever before in achieving the rapid rollout, flexibility, scalability, and higher SLA requirements on business systems, all while working with limited budgets. When O&M is placed into the high complexity of Cloud Data Center (DC) environments saturated with massive amounts of equipment, achieving 99.95 percent quality in delivery of IT services while improving efficiency and lowering costs is the biggest challenge personnel face.
To improve utilization efficiency, the resources in cloud architectures are shared, meaning that specific services and applications do not run on dedicated equipment. This approach is completely different from traditional IT models. Automated and flexible scalability strategies in cloud computing achieve a balance in terms of resource sharing, user experience, and service availability. This is a core advantage of the cloud computing model, but it also brings with it new challenges in O&M. Personnel often do not even know which piece of equipment is carrying a particular service, making it difficult to pinpoint faults. The ability to prepare for these unknowns requires more complete monitoring across the entire system to yield the needed visibility.
The leap to cloud architectures in enterprise IT does not happen in one stride; it is a long-term, phase-in process. The difference between traditional and cloud architectures requires many different tools, which presents greater challenges to O&M personnel. Achieving unification across both IT structures and centralizing management are just two of the new issues being faced.
The distributed architecture of cloud computing systems features automation in resource scheduling, fault isolation, fault recovery, and other utilities. This level of automation has overturned the traditional approaches to installing and deploying IT software as well as the familiar models in service usage and maintenance. The vast majority of operations are no longer carried out by personnel but are now automated. Therefore, personnel tasks are changing from being management-focused to requiring them to build automated O&M schemes and develop tools. O&M systems are also expected to evolve.
The key to automation is making IT systems more intelligent. Only when fully enabled with smart attributes can system scalability, fault isolation, and restoration capabilities reach the level needed for enterprise system status, user scale, and service experience quality, as well as policy rules. Smart management and O&M systems are being relied on to deliver 1) automation in management throughout the entire lifecycle, 2) the needed intelligent fault prevention, detection, and self-healing mechanisms, and 3) wide-open flexibility in how capacities are allocated with intelligent capacity management utilities.
The scale of resource and business capacity at cloud DCs far exceeds that of traditional layouts. If manual approaches were used to roll out, monitor, upgrade, expand, limit, downgrade, or decommission cloud-based services, the result would be low efficiency and a higher risk of operator error. Automation is imperative to improving the per-capita efficiency rating of personnel, satisfying agile service requirements, and gradually enabling the move to unattended O&M.
In addition to making use of the common capabilities afforded from the framework, O&M personnel can customize or make changes to the automation services whenever needed. After personnel develop the atomic scripts, the scripts are then visualized and submitted. The platform can then automatically schedule and implement the various online management utilities for each service.
Figure 1: Automated job customization process
In traditional models, the working style for O&M personnel was to wait for faults to occur. According to statistics, the planned work for O&M staff accounted for only about 50 percent of their day with the remaining time allotted to putting out fires. With the rapid growth in scale of cloud data centers, O&M personnel need to handle a growing number of events. Using manual-intensive methodologies to put out each little fire in the system is just not a viable solution. This is precisely why intelligent O&M platforms are required. These platforms make use of Big Data associative analysis and machine-learning technology to deliver the needed artificial intelligence enablement for the O&M system and provide intelligent support capabilities in everything from fault prevention and location to closed-loop handling.
［Key measurement #1]: Reduce manual operation-induced faults
According to the IT Department at Huawei, manually performed change operations are the reason for over 50 percent of faults. Most level-one events are also caused by some change operation. Those types of operations tend to be rather complex, which, in turn, makes manual handling prone to error. Automating the processes avoids unnecessary faults from manual operations, which is a key measurement for reducing failure rates.
［Key measurement #2]: Implement intelligent analysis on systems to find sub-health contributors and detect potential faults early
Using Big Data technology combined with cross-domain correlation analysis on the features of the faults enables early detection and predictive analysis. Integration with the automated strategy execution system allows problems to be solved before users even know something is wrong without any interruption to services.
［Key measurement #1]: Build a fully linked, active, intelligent, and multi-indexed monitoring system to comprehensively cover all elements with multi-mechanism integrations
O&M systems need to support unified management over equipment rooms and facilities, physical infrastructures, cross-center backbone networks, virtualized resource pools, cloud services, and applications to offer centralized and multi-faceted monitoring across multiple DCs.
If a fault occurs at a data center, the current and historical operating status of each resource and cloud service at the center can be quickly obtained with the conveniences of visualization. The information that can be queried includes performance capacity, associated objects and alarms, and information on the topology and various logs.
［Key measurement #2]: Visualization of systems’ statuses
Visualized display of the application topology and health status allows personnel to view operational indicators and changes to critical services at a glance. The business application indexes we usually collect include user experience (page response speed and access performance), user behavior (number of user visits, number of active users, and maximum amount of concurrent access), business efficiency (end-to-end business processing time, transaction success rate, and the volume of supported traffic), and SLA.
IT operations personnel and administrators can use the information relating to the performance capacity of O&M items, alarm statistics and analysis, resource utilization reports, and health and capacity forecasts to generate monthly and quarterly reports on O&M quality analysis to support annual IT planning.
［Key measurement #1]: Use of traffic tracking systems to locate faults rapidly
To help solve the problems in the complexity of scheduling cloud-based and microservices in addition to locating faults in such environments, supplementary fault locating tools are needed to improve efficiency. Through monitoring of various metrics, the time needed to locate a fault is reduced from hours down to minutes.
［Key measurement #2]: Build an expert diagnostic system complete with intelligent fault locating capabilities, and automated recovery and processing on known faults
Routine analysis of fault summaries and the continuous accumulation into the fault feature library help experts yield intelligent fault location capabilities and automated recovery operations on known faults.
In traditional DCs, business systems deployed by each respective department cannot be shared, and server utilization is as low as 20 percent. Moving DCs to the cloud enables resource sharing and dynamic scheduling capabilities but brings challenges such as fragmentation, load imbalances, and difficulty in guaranteeing SLAs.
Intelligent capacity management combines Big Data analysis and forecasting technologies to present the available capacity of physical resources (server, storage, and network devices) and cloud-based resources (VMs, block storage, and so on) in real time. Utilities are also able to capture snapshots of capacity, the loads on devices, and an overall view on fragmentation. Traditional O&M is unable to migrate or dynamically expand capacities, resulting in unbalanced loads. In cloud DCs, capacity management supplies O&M administrators with information on resources with low loads and provides recommendations on adjustments. In traditional layouts, resource fragmentation often leads to utilization efficiencies as low as 20 percent while other silos are running out of resources. Capacity fragmentation management in cloud-enabled centers provides O&M administrators a view on the physical distribution of the various types of resources as well as recommendations on resource tuning, which provides greatly improved utilization rates of existing resources.
When the utilization of cloud resources reaches a certain threshold, planners need to start thinking about expansion. Traditional approaches to forecasting the required capacity mainly relied on the limited experience of individuals and the limited amount of data that could be imported to help derive a trend. From there, a forecast was made on how much capacity could actually be created. To be safe, planners usually overshoot the capacity, causing 20 percent to 30 percent of resources to remain idle. In contrast, intelligent capacity management combines data on resource capacities, application behavior analysis, performance, financial information, and other dimensions to provide accurate predictions on how the various types of resource capacities applications used in multiple business departments will impact the IT infrastructure. Planners then have a much clearer picture of the resources required for the future as they formulate effective procurement and expansion plans.
Intelligent capacity management achieves improved visibility into resource status, the ability to observe and track issues, identification of risks, and measurable and adjustable controls to improve the resource utilization rate to 70 percent or more.
More successful approaches in cloud DCs apply automated and intelligent O&M systems. The advancements achieve impressive improvements in O&M efficiency, all while ensuring 99.95 percent or higher quality in services at the user-level. In traditional O&M approaches, each technician could manage 50 to 100 devices. Now, each technician can manage 5,000 to 10,000 devices (a 100-fold improvement). Overall, resource utilization has also improved from as low as 20 percent to 60 percent to 70 percent, a 300 percent improvement (Table 1).
Table 1: Improvements in O&M efficiency
In one example, Huawei’s R&D department was able to improve resource utilization from 10 percent in many cases to 40 percent to 50 percent using only 11 people to maintain 100,000 devices with the use of standardized, automated, intelligent O&M.
At the same time, the introduction of an automated, intelligent, visualized O&M platform allows personnel to break away from mechanical, repetitive, and low-value tasks while lowering incidences of human error in processing, which indirectly helps protect the quality of IT services and lower operational costs. More importantly, O&M personnel are freed up to take part in higher value tasks, such as architectural design and development in addition to assessment and the introduction of new technologies to better support business innovation. IT teams and individuals can create more value for the enterprise by applying automated and intelligent O&M platforms that also help standardize IT management processes with the use of tools. With this automation, the entire O&M process becomes standardized, and compliance is improved to ensure SLAs are met to support the healthy development of business.
Figure 2: Open Huawei cloud O&M platform
In addition to helping enterprises build an automated, intelligent, and visualized O&M platform, the Huawei Cloud Data Center solution leverages the telecom team’s many years of expertise and explorative achievements in new technologies.
Huawei’s internal O&M teams are responsible for maintaining the massive scale of the Huawei Enterprise Cloud and its own private cloud. These teams are also responsible for monthly O&M quality analysis as well as statistical analysis and summaries on faults. High-risk and high-frequency operations require automation. Huawei’s self-operated enterprise cloud employs the DevOps model to rapidly build up and improve on O&M capabilities. After being fully validated, O&M capabilities are commercialized and included in the baseline version of the Huawei Cloud O&M solution to make the best practices available to our wide customer base. For example, the ECS service calls upon tracking tools, as one of the methods for accumulating experience from routine O&M, to integrate capabilities and improve the platform.
The developer community for cloud-based O&M that Huawei operates opens up APIs used at several layers to meet the application development requirements of various use cases. The community allows partners to participate in the accumulation of capabilities, enrich tool catalogs, and enhance the components going into the O&M offerings, thereby strengthening the cloud-based O&M ecosystem.
Huawei cloud O&M systems adopt a microservice architecture in support of container-based deployments, enabling agility in delivery and contributing to excellent scalability. Agility in delivery means each microservice is independently development, released, and updated for quick iterations. Excellent scalability means flexible expansion of each microservice, which, in turn, helps to ensure ease in scalability for the entire O&M system; minimal amounts of resources can be deployed during the beginning and grown on demand. Container-based deployments significantly reduce the costs in managing nodes.
Huawei has engaged in the communications technology field for 28 years, serving the carrier domain throughout the world. Huawei has established a complete technical support system comprising two global technical assistance centers and numerous regional technical assistance centers. The company has trained teams of highly skilled experts around the world to work in its technical support system that spans the globe.
Huawei offers a wide range of O&M models for customers to choose from, including customer-managed O&M, Huawei-managed O&M with on-site personnel, or Huawei-assisted remote O&M. Customers opting for the self-managed model can still avail themselves of the 24/7 customer hotline, deploy Cloud Service to enable automatic fault reporting and troubleshooting, and make use of the Huawei eCare tool to monitor all processes and ensure timely solutions to customer problems.
Figure 3: Technical support system
Figure 4: Intelligent O&M platform
With its extensive expertise in the O&M field of ICT infrastructure and leveraging the total advantages of its own product line, Huawei delivers complete cloud DC management capabilities covering everything from servers and storage equipment to networking, virtualized resources, and cloud-based services and applications. Full-stack management paves the way for end-to-end service monitoring, fault diagnosis and locating, and automation during the entire lifecycle, among other capabilities.
In the last three years, the scale of Huawei’s cloud-based DCs has increased several times over. The O&M solution has reached a 99.6 percent SLA fulfillment rate with less than 10 percent increase in O&M personnel. The average utilization rate of computing resources has reached over 50 percent, better supporting agile development in R&D. In one instance, the planned maintenance and version upgrade to DCs during the 2016 National Day holiday in China involved 11 equipment rooms across the country with a total of 15,000 physical servers and 300,000 VMs. If traditional O&M approaches were used, each engineer would have been able to handle only 3,000 to 4,000 VMs, which would have required more than 100 employees. With the power-up and power-down in a single click and the batch version upgrade capabilities of the intelligent O&M platform, fewer than 20 people were needed and the necessary time to power up and power down each equipment room was cut in half (from 10 hours down to 5).
Cloud-based O&M is an essential part of any cloud computing layout. The importance continues to grow as these platforms are becoming a core competitive strength. As a next step, Huawei will increase investments in artificial intelligence applied to cloud-based O&M and extend the inclusion of robotics at DCs in more O&M use cases to replace the traditional approach in manual operations. All these investments are aimed at providing customers with a highly automated and intelligence-infused cloud DC O&M solution able to achieve at least limited ‘unattended’ operations.