This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Search
  • banner pc

    Intelligent O&M in Future Cloud Data Centers

In the cloud computing era, IT system builds are becoming an increasingly vital link to achieving development agendas. Business systems and the infrastructure supporting those systems are a prime point of concern for many enterprises. Operations and Maintenance (O&M) systems are the ‘heroes behind the scenes’ that keep these platforms up and running, making the O&M systems mission critical. In each IT system transformation, the greatest difficulties are often achieving the needed level of service assurance and enacting a viable O&M program, as well as moving to cloud architectures which present several more challenges.

New O&M Systems Requirements from Cloud Architectures 

O&M Pressures from Cloud Computing and Business Requirements

As more and more enterprises embrace the cloud, O&M personnel will face more pressure than ever before in achieving the rapid rollout, flexibility, scalability, and higher SLA requirements on business systems, all while working with limited budgets. When O&M is placed into the high complexity of Cloud Data Center (DC) environments saturated with massive amounts of equipment, achieving 99.95 percent quality in delivery of IT services while improving efficiency and lowering costs is the biggest challenge personnel face.

  • Guaranteeing high O&M quality: The amount of equipment within data centers has grown exponentially, from several dozen to hundreds and thousands, or even millions of pieces, as we migrate to cloud-based DCs. The massive number of devices presents a huge challenge to achieving rapid fault positioning and isolation. Adding virtualization and distributed elasticity technologies makes the cloud DC environment even more complex which, in turn, makes O&M more difficult. Once seldom-occurring faults become the norm and system impact increases, reaching 99.95 percent quality at the user-level in the SLA is difficult to achieve.
  • Improving O&M efficiency: Adding virtualization capabilities and open-source technologies makes O&M even more complicated. Manual operations and maintenance on a network proves too slow and the probability of error too high. Most personnel can handle from 50 to 100 pieces of equipment, which indicates that a large amount of manpower would be needed to operate any large cloud-based environments.
  • Maintaining low operation costs: Resource utilization is typically 20 percent or less in traditional IT. That rate improves significantly when resources are moved to the cloud model. However, the very nature of personalized applications and requirements for on-demand elasticity tend to fragment resources and lead to load imbalance, making it difficult to plan capacities. As a result, planning objectives are not met, and O&M costs continue to run high.

Service Assurances and High Availability Requirements of Cloud Architectures Create Several Unknowns for O&M Planning

To improve utilization efficiency, the resources in cloud architectures are shared, meaning that specific services and applications do not run on dedicated equipment. This approach is completely different from traditional IT models. Automated and flexible scalability strategies in cloud computing achieve a balance in terms of resource sharing, user experience, and service availability. This is a core advantage of the cloud computing model, but it also brings with it new challenges in O&M. Personnel often do not even know which piece of equipment is carrying a particular service, making it difficult to pinpoint faults. The ability to prepare for these unknowns requires more complete monitoring across the entire system to yield the needed visibility. 

Unified O&M Management in Hybrid IT Systems

The leap to cloud architectures in enterprise IT does not happen in one stride; it is a long-term, phase-in process. The difference between traditional and cloud architectures requires many different tools, which presents greater challenges to O&M personnel. Achieving unification across both IT structures and centralizing management are just two of the new issues being faced. 

Requirements for Full Automation Include Personnel to Transition from a Management to a Developmental Role

The distributed architecture of cloud computing systems features automation in resource scheduling, fault isolation, fault recovery, and other utilities. This level of automation has overturned the traditional approaches to installing and deploying IT software as well as the familiar models in service usage and maintenance. The vast majority of operations are no longer carried out by personnel but are now automated. Therefore, personnel tasks are changing from being management-focused to requiring them to build automated O&M schemes and develop tools. O&M systems are also expected to evolve.

Intelligent O&M Supports Automation in IT Systems

The key to automation is making IT systems more intelligent. Only when fully enabled with smart attributes can system scalability, fault isolation, and restoration capabilities reach the level needed for enterprise system status, user scale, and service experience quality, as well as policy rules. Smart management and O&M systems are being relied on to deliver 1) automation in management throughout the entire lifecycle, 2) the needed intelligent fault prevention, detection, and self-healing mechanisms, and 3) wide-open flexibility in how capacities are allocated with intelligent capacity management utilities. 

Automations in Management throughout the Entire Lifecycle

The scale of resource and business capacity at cloud DCs far exceeds that of traditional layouts. If manual approaches were used to roll out, monitor, upgrade, expand, limit, downgrade, or decommission cloud-based services, the result would be low efficiency and a higher risk of operator error. Automation is imperative to improving the per-capita efficiency rating of personnel, satisfying agile service requirements, and gradually enabling the move to unattended O&M.

  • Workflow-centric automation in service platforms simplify complicated operations: Automated service platforms provide standardized and tool-based architectures that, through reconfigurations for high-change scenarios, are able to achieve out-of-the-box results, greatly simplify formerly complex operations, and significantly improve O&M efficiencies while lowering the incidence of erroneous operations. These high-change scenarios include rectification of known typical faults, expansions and reductions to capacities of resource pools, installation of patches, execution of health checks, compliance auditing, rectification of non-compliance issues, batch software installations, backup policies for management node configurations, extraction of configuration information, and power-on and power-off processes for large number of devices. With authority and domain-based management in addition to provisioning of operation logs, security and auditing requirements are satisfied while O&M and changes become more controllable and efficient.

In addition to making use of the common capabilities afforded from the framework, O&M personnel can customize or make changes to the automation services whenever needed. After personnel develop the atomic scripts, the scripts are then visualized and submitted. The platform can then automatically schedule and implement the various online management utilities for each service.

Figure 1: Automated job customization process

  • Standardized and consistent O&M approaches are key: Large differences between the software and hardware from the various vendors in traditional DCs necessitated a lot of configurations to get the various components to work with each other as build and day-two operation complexity continued to rise, making it difficult to implement a fully viable plan. In the cloud era, the use of standard compute, storage, and networking hardware in addition to standardized software installation packages, configurations, permissions, dark launch strategies, scripts, and system health indicators enable O&M personnel to manage the entire cloud environment with the convenience of visualization and improved predictability. Self-corrections are executed according to presets to reduce risk of operator error in frequent changes.
  • Hardware plug-and-play; easy replacement: As the scale of the data center increases, manual identification and installation of hardware become incapable of supporting rapid rollout, expansion, and decommissioning requirements. With plug-and-play technology, less-skilled personnel can install equipment on shelves, connect appliances to the network, and power on devices. The O&M system completes the end-to-end hardware system deployment and rollout according to the presets. Cloud isolation technology also allows less-skilled personnel to change out failing hardware.
  • One-click software deployment; always online: With the rise of agile, distributed software development and deployment models, system upgrades are more frequent and complex in cloud DCs versus their traditional counterparts. Tools provisioned with a single click achieve automation in end-to-end deployments in everything from applying for resources to provisioning and deployment, system self-tests, service tests, rollbacks, and dark launches while supporting centralized provisioning of hundreds or even thousands of instances at multiple data centers located throughout the world.
  • Mobile O&M: With O&M Apps available in the palm of the hand, experts can perform tasks on cloud resources from anywhere and at any time throughout the entire management lifecycle.

Intelligent Fault Prevention, Detection, and Self-healing

In traditional models, the working style for O&M personnel was to wait for faults to occur. According to statistics, the planned work for O&M staff accounted for only about 50 percent of their day with the remaining time allotted to putting out fires. With the rapid growth in scale of cloud data centers, O&M personnel need to handle a growing number of events. Using manual-intensive methodologies to put out each little fire in the system is just not a viable solution. This is precisely why intelligent O&M platforms are required. These platforms make use of Big Data associative analysis and machine-learning technology to deliver the needed artificial intelligence enablement for the O&M system and provide intelligent support capabilities in everything from fault prevention and location to closed-loop handling.

  • Active fault prevention: No matter how fast a fault can be handled, it is still not as good as the fault never happening at all, especially in large-scale cloud DCs. Even a very low failure rate means a certain degree of impact. Prevention is the absolute best approach in avoidance of troublesome operations and those ‘little fires’ that keep personnel running around all day.

[Key measurement #1]: Reduce manual operation-induced faults

According to the IT Department at Huawei, manually performed change operations are the reason for over 50 percent of faults. Most level-one events are also caused by some change operation. Those types of operations tend to be rather complex, which, in turn, makes manual handling prone to error. Automating the processes avoids unnecessary faults from manual operations, which is a key measurement for reducing failure rates.

[Key measurement #2]: Implement intelligent analysis on systems to find sub-health contributors and detect potential faults early

Using Big Data technology combined with cross-domain correlation analysis on the features of the faults enables early detection and predictive analysis. Integration with the automated strategy execution system allows problems to be solved before users even know something is wrong without any interruption to services.

  • Timely fault detection: Cloud DCs are stacked with layers of technologies and feature complex architectures, making it difficult to identify faults. Building an end-to-end monitoring system to analyze the status of all systems covering everything from resources to tenant experience helps identify sluggish response, slow query, and deteriorating device performance in service systems (frequent faults, high transaction failure rates, and so on). This type of all-around monitoring helps find the root cause of problems in low user participation and resource utilization, among other issues. The data helps technical teams continuously improve O&M management.

[Key measurement #1]: Build a fully linked, active, intelligent, and multi-indexed monitoring system to comprehensively cover all elements with multi-mechanism integrations

O&M systems need to support unified management over equipment rooms and facilities, physical infrastructures, cross-center backbone networks, virtualized resource pools, cloud services, and applications to offer centralized and multi-faceted monitoring across multiple DCs.

If a fault occurs at a data center, the current and historical operating status of each resource and cloud service at the center can be quickly obtained with the conveniences of visualization. The information that can be queried includes performance capacity, associated objects and alarms, and information on the topology and various logs.

[Key measurement #2]: Visualization of systems’ statuses

Visualized display of the application topology and health status allows personnel to view operational indicators and changes to critical services at a glance. The business application indexes we usually collect include user experience (page response speed and access performance), user behavior (number of user visits, number of active users, and maximum amount of concurrent access), business efficiency (end-to-end business processing time, transaction success rate, and the volume of supported traffic), and SLA.

IT operations personnel and administrators can use the information relating to the performance capacity of O&M items, alarm statistics and analysis, resource utilization reports, and health and capacity forecasts to generate monthly and quarterly reports on O&M quality analysis to support annual IT planning.

  • Intelligent fault locating: The cloud era features distributed and microservice-based software architectures. The relationship between services and the scheduling that takes place is becoming increasingly more complex. This poses a great challenge to quick location of faults.

[Key measurement #1]: Use of traffic tracking systems to locate faults rapidly

To help solve the problems in the complexity of scheduling cloud-based and microservices in addition to locating faults in such environments, supplementary fault locating tools are needed to improve efficiency. Through monitoring of various metrics, the time needed to locate a fault is reduced from hours down to minutes.

[Key measurement #2]: Build an expert diagnostic system complete with intelligent fault locating capabilities, and automated recovery and processing on known faults

Routine analysis of fault summaries and the continuous accumulation into the fault feature library help experts yield intelligent fault location capabilities and automated recovery operations on known faults.

  • Automatic fault recovery: Cloud DC expansion results in the dramatic increase in the number of faults. Huawei’s experience in DC O&M has shown that, if automatic fault classification and processing are not carried out on large-scale cloud DCs, thousands of trouble tickets of varying degrees would be logged each day. Thus exists the need for O&M systems that are able to identify common faults and implement the appropriate self-healing strategy. When a fault does occur, a closed-loop strategy is automatically initiated without the need for manual intervention. 

Intelligent Capacity Management Improves Utilization Rates

In traditional DCs, business systems deployed by each respective department cannot be shared, and server utilization is as low as 20 percent. Moving DCs to the cloud enables resource sharing and dynamic scheduling capabilities but brings challenges such as fragmentation, load imbalances, and difficulty in guaranteeing SLAs.

Intelligent capacity management combines Big Data analysis and forecasting technologies to present the available capacity of physical resources (server, storage, and network devices) and cloud-based resources (VMs, block storage, and so on) in real time. Utilities are also able to capture snapshots of capacity, the loads on devices, and an overall view on fragmentation. Traditional O&M is unable to migrate or dynamically expand capacities, resulting in unbalanced loads. In cloud DCs, capacity management supplies O&M administrators with information on resources with low loads and provides recommendations on adjustments. In traditional layouts, resource fragmentation often leads to utilization efficiencies as low as 20 percent while other silos are running out of resources. Capacity fragmentation management in cloud-enabled centers provides O&M administrators a view on the physical distribution of the various types of resources as well as recommendations on resource tuning, which provides greatly improved utilization rates of existing resources.

When the utilization of cloud resources reaches a certain threshold, planners need to start thinking about expansion. Traditional approaches to forecasting the required capacity mainly relied on the limited experience of individuals and the limited amount of data that could be imported to help derive a trend. From there, a forecast was made on how much capacity could actually be created. To be safe, planners usually overshoot the capacity, causing 20 percent to 30 percent of resources to remain idle. In contrast, intelligent capacity management combines data on resource capacities, application behavior analysis, performance, financial information, and other dimensions to provide accurate predictions on how the various types of resource capacities applications used in multiple business departments will impact the IT infrastructure. Planners then have a much clearer picture of the resources required for the future as they formulate effective procurement and expansion plans.

Intelligent capacity management achieves improved visibility into resource status, the ability to observe and track issues, identification of risks, and measurable and adjustable controls to improve the resource utilization rate to 70 percent or more.

Results from O&M Practices at Cloud DCs

More successful approaches in cloud DCs apply automated and intelligent O&M systems. The advancements achieve impressive improvements in O&M efficiency, all while ensuring 99.95 percent or higher quality in services at the user-level. In traditional O&M approaches, each technician could manage 50 to 100 devices. Now, each technician can manage 5,000 to 10,000 devices (a 100-fold improvement). Overall, resource utilization has also improved from as low as 20 percent to 60 percent to 70 percent, a 300 percent improvement (Table 1).

Table 1: Improvements in O&M efficiency

In one example, Huawei’s R&D department was able to improve resource utilization from 10 percent in many cases to 40 percent to 50 percent using only 11 people to maintain 100,000 devices with the use of standardized, automated, intelligent O&M.

At the same time, the introduction of an automated, intelligent, visualized O&M platform allows personnel to break away from mechanical, repetitive, and low-value tasks while lowering incidences of human error in processing, which indirectly helps protect the quality of IT services and lower operational costs. More importantly, O&M personnel are freed up to take part in higher value tasks, such as architectural design and development in addition to assessment and the introduction of new technologies to better support business innovation. IT teams and individuals can create more value for the enterprise by applying automated and intelligent O&M platforms that also help standardize IT management processes with the use of tools. With this automation, the entire O&M process becomes standardized, and compliance is improved to ensure SLAs are met to support the healthy development of business.

Figure 2: Open Huawei cloud O&M platform

Best Practices in Huawei O&M Solution Deployments at Cloud DCs

In addition to helping enterprises build an automated, intelligent, and visualized O&M platform, the Huawei Cloud Data Center solution leverages the telecom team’s many years of expertise and explorative achievements in new technologies. 

Accumulation of O&M Expertise, Commercialization of O&M Capabilities

Huawei’s internal O&M teams are responsible for maintaining the massive scale of the Huawei Enterprise Cloud and its own private cloud. These teams are also responsible for monthly O&M quality analysis as well as statistical analysis and summaries on faults. High-risk and high-frequency operations require automation. Huawei’s self-operated enterprise cloud employs the DevOps model to rapidly build up and improve on O&M capabilities. After being fully validated, O&M capabilities are commercialized and included in the baseline version of the Huawei Cloud O&M solution to make the best practices available to our wide customer base. For example, the ECS service calls upon tracking tools, as one of the methods for accumulating experience from routine O&M, to integrate capabilities and improve the platform.

Opening up Capabilities to Build up the Cloud-based O&M Ecosystem

The developer community for cloud-based O&M that Huawei operates opens up APIs used at several layers to meet the application development requirements of various use cases. The community allows partners to participate in the accumulation of capabilities, enrich tool catalogs, and enhance the components going into the O&M offerings, thereby strengthening the cloud-based O&M ecosystem.

  • Opening up the service layer: All interfaces on the service console are opened up to secondary development so third parties can customize interfaces and portals to suit various industry use cases.
  • Opening up the backend service layer: All O&M services are opened up to secondary development through API gateways to permit third parties to develop new O&M tools or allow the Huawei offerings to connect to third-party O&M tools and systems. For example, field-specific service topology views can be developed from alarm and resource management utilities opened up for further development, which also yields visualization of service node status. In hybrid IT architectures, performance capacity, configuration information, and logs can all be linked to the customer’s centralized O&M management platform through API gateways to deliver an O&M system complete with global sharing capabilities.
  • Opening up the access layer: Provisioning of southbound drivers in the plug-in framework allows third parties to develop their own device drivers. New device objects can be accessed through the dynamic capabilities in driver management. The driver developed by ZOHO, for example, is used to execute monitoring and report management on non-Huawei equipment.

Microservice Architecture and Container-based Deployments

Huawei cloud O&M systems adopt a microservice architecture in support of container-based deployments, enabling agility in delivery and contributing to excellent scalability. Agility in delivery means each microservice is independently development, released, and updated for quick iterations. Excellent scalability means flexible expansion of each microservice, which, in turn, helps to ensure ease in scalability for the entire O&M system; minimal amounts of resources can be deployed during the beginning and grown on demand. Container-based deployments significantly reduce the costs in managing nodes. 

Global Technical Support System

Huawei has engaged in the communications technology field for 28 years, serving the carrier domain throughout the world. Huawei has established a complete technical support system comprising two global technical assistance centers and numerous regional technical assistance centers. The company has trained teams of highly skilled experts around the world to work in its technical support system that spans the globe.

Huawei offers a wide range of O&M models for customers to choose from, including customer-managed O&M, Huawei-managed O&M with on-site personnel, or Huawei-assisted remote O&M. Customers opting for the self-managed model can still avail themselves of the 24/7 customer hotline, deploy Cloud Service to enable automatic fault reporting and troubleshooting, and make use of the Huawei eCare tool to monitor all processes and ensure timely solutions to customer problems. 

Figure 3: Technical support system

Figure 4: Intelligent O&M platform

Support for Full-stack Management

With its extensive expertise in the O&M field of ICT infrastructure and leveraging the total advantages of its own product line, Huawei delivers complete cloud DC management capabilities covering everything from servers and storage equipment to networking, virtualized resources, and cloud-based services and applications. Full-stack management paves the way for end-to-end service monitoring, fault diagnosis and locating, and automation during the entire lifecycle, among other capabilities.

In the last three years, the scale of Huawei’s cloud-based DCs has increased several times over. The O&M solution has reached a 99.6 percent SLA fulfillment rate with less than 10 percent increase in O&M personnel. The average utilization rate of computing resources has reached over 50 percent, better supporting agile development in R&D. In one instance, the planned maintenance and version upgrade to DCs during the 2016 National Day holiday in China involved 11 equipment rooms across the country with a total of 15,000 physical servers and 300,000 VMs. If traditional O&M approaches were used, each engineer would have been able to handle only 3,000 to 4,000 VMs, which would have required more than 100 employees. With the power-up and power-down in a single click and the batch version upgrade capabilities of the intelligent O&M platform, fewer than 20 people were needed and the necessary time to power up and power down each equipment room was cut in half (from 10 hours down to 5).

Cloud-based O&M is an essential part of any cloud computing layout. The importance continues to grow as these platforms are becoming a core competitive strength. As a next step, Huawei will increase investments in artificial intelligence applied to cloud-based O&M and extend the inclusion of robotics at DCs in more O&M use cases to replace the traditional approach in manual operations. All these investments are aimed at providing customers with a highly automated and intelligence-infused cloud DC O&M solution able to achieve at least limited ‘unattended’ operations.

TOP