Ever-expanding network bandwidths are creating a deluge of applications, services, and data, which in turn are posing constant challenges on data center loads, process response times, and fault tolerance protection.
Against this backdrop, cloud computing has emerged as a mainstream solution to deliver resource flexibility and elasticity so that data centers can cope with the growing workload.
Intelligent Cloud Explained
Intelligent cloud is a form of computing capable of autonomous sensing, scheduling, repair, and evolution. It frees operators from basic repetitive tasks to automate cloud-based data centers by leveraging Artificial Intelligence (AI) technologies.
Before the rise of civilization, apex predators such as lions and tigers dominated the food chain and early humans had to rely on physical strength to survive. The requirement for strength alone changed with the emergence of the more calculating Homo sapiens. Intelligence — the power of inductive and deductive reasoning — has made it possible for human beings to assume our dominance over life on Earth.
Cloud computing is one of mankind’s most recent and advanced accomplishments.
The earliest improvements in the cloud computing domain relied on the simple expansion of equipment room floor space, which was accompanied by a matched increase in power consumption. A negative result of this type of growth was ‘cloud sprawl,’ or the uncontrolled proliferation of cloud instances.
A primary advantage of ‘intelligent cloud’ computing is the ability to manage resources for the precise purpose of maximizing power efficiency by scheduling resource use with a neural network system.
Overall, there are two broad issues that limit the performance of cloud platform operations:
1. The deployment of applications on the cloud
2. Effective and efficient cloud operations and resource management
Development, Data Division, and Deployment
Deploying applications on the cloud requires three steps:
1. Adapting applications into services or micro-services
2. Structurally dividing data into databases and tables, and storage of those segments
3. Applying applications and migrating existing data to the cloud
Data structure segmentation is tricky. It includes a variety of methods like horizontal, vertical, and hash-based segmentation, and each has its characteristic pain points.
Service-based data segmentation causes performance decline during cross-service joins, and region-based segmentation reduces performance during cross-region joins. Likewise, hash-based segmentation results in poor performance with regards to cross-table joins.
Deciding which method to use is difficult, so how do we solve this problem?
Micro-services are the best answer.
Data can be segmented based on cloud data center service requirements, which may cause problems like join and consistent transactions, but these can be resolved through micro-services, messages, and eventual consistency.
Performance soars with such solutions.
Huawei’s FunctionStage provides event-driven function hosting and computing services. Based on the split data, FunctionStage decomposes SQL statements and uses function services to execute the SQL statements separately and then merge the results, allowing the functions to run in an elastic, O&M-free, and highly reliable manner.
Adapting applications into services and micro-services is a fundamental change in the way that applications are managed. In contrast, the separation of front-end and back-end processes while implementing a Service-Oriented Architecture (SOA) is only a mild alteration for managing applications.
Deploying a cloud-based architecture for services and micro-services splits the applications apart. Thus, the applications can only be pieced together again through registration and governance centers.
Traditional application development is not dependent on a large number of architects, and programmers understand the project’s big picture.
On the other hand, cloud-based application development is more reliant on the number of architects, and programmers have only a partial understanding of the whole project. The applications are brought together to form a whole through the architecture.
Huawei’s ServiceStage provides exactly such an architecture, which simplifies application lifecycle management, such as deployment, monitoring, Operations and Maintenance (O&M), and governance, for enterprise DevOps personnel.
Application deployment and data migration, coupled with the basic service capability provided by the cloud, are not as complex as application development.
There are many application deployment modes available in cloud data centers, including virtual machine-based image deployment, manual deployment, container-based image deployment, software package deployment, and even serverless deployment. Data can be migrated either by interrupting services, or by bypassing access to gradually divert data for bidirectional consistency.
Huawei’s Data Replication Service (DRS) is an easy-to-use, stable, and efficient data migration tool for use in scenarios like online database migration and real-time synchronization.
Cloud-Based Resource and Application O&M
Imagine our forefathers building the very first stone houses. They must have entered these dwellings with overwhelming dread, fearing collapse at any moment.
As a new resource form, cloud data centers are fragile and prone to collapse much like the earliest stone houses. Think of efficient and effective O&M tools as the key systems that ensure cloud data center stability.
Physical device layer O&M is similar to that for traditional data centers. The original method uses device status alarms and indicators to keep maintenance personnel informed and help troubleshoot the devices. In addition, the cloud management platform can also be used to monitor device management controllers for platform-based O&M.
Huawei’s eSight provides an integrated O&M management solution that provides comprehensive O&M for servers, storage, virtualization, switches, routers, Wireless Local Area Networks (WLANs), and firewalls.
Due to its lack of visibility, the virtualization layer O&M requires more dedicated implementation tools.
The cloud data center boasts a standard virtual resource O&M system capable of obtaining virtual computing monitoring parameters, virtual storage, and virtual networks to implement threshold-crossing alarms and performance monitoring, and provide resource tenants with resource monitoring and an alarm system.
Huawei’s OperationCenter tool and cloud services, such as the Cloud Eye Service (CES) and Cloud Trace Service (CTS), provide threshold setting, fault alarms, and performance monitoring to allow systematic resource O&M at the virtualization layer.
Due to the introduction of a services and micro-service-oriented architecture, the application layer O&M is highly dependent on advanced tools for comprehensively monitoring applications status and health. The Elasticsearch, Logstash, Kibana (ELK)-based log collection and retrieval system is the tool that Huawei uses to perform basic O&M for cloud-based applications for everything from traditional application nodes to hundreds of complicated cloud-native application nodes.
Huawei’s Application Operations Management (AOM) service uses ELK-based log collection to monitor and manage application performance and faults in real time, and analyze performance bottlenecks in the distributed architecture.
Looking Beyond Current Issues to New Solutions
The intelligent cloud platform is designed to let the network manage the network and enable applications to manage applications. The intelligent cloud is still in its infancy and currently lacks a complete technical stack, but two examples reveal how AI can improve the platform.
Deploying Cloud Applications
The cloud data center provides elastic scalability that enables applications to provision computing resources dynamically. But how do we determine the minimum and maximum resource usage for applications? How do we determine the size of storage resources? How do resources change during service and production processes?
Answering these questions relies on coarse-grained calculations and simplistic planning. Though, when AI is introduced it is expected that intelligent algorithms will provide answers better-suited to real-world scenarios.
The above figure represents the total monitored application memory usage, and a two-dimensional plane between the total memory usage and the time axis.
The Bayesian algorithm maps a curve that estimates total memory use at specific points in the future. In addition, annealing simulations can estimate maximum value (peak memory usage) for the entire curve, and resource tenants will adjust resource usage strategies to achieve optimum resource allocation ratios.
Huawei is an active industry leader in this regard. We have established a complete horizontal service chain, and proactively check core services such as cloud servers, cloud hard drives, and object storage by enabling cloud monitoring and security index services. We analyze tenant services in an end-to-end manner, and provide platform-level optimization solutions.
Checking Network Health
Traditional network O&M is more likely to address faults in a reactive manner. That is, a solution is only provided after problems occur.
By contrast, the new network O&M is required to focus on predictive insight and proactive processing, and must use predictive knowledge to prevent network faults.
A large number of Network Elements (NEs) exist in the cloud data center. These include underlay physical NEs, and virtual NEs such as Overlay Virtual Switching (OVS) and Distributed Virtual Routing (DVR).
Monitoring the entire network requires processing a massive collection of data, such as logs and status data.
Large resource overhead is required for big data processing tools like Hadoop and Spark. Therefore, AI is a better option, as it can estimate and locate network faults using only a small amount of computing power.
After a Convolutional Neural Network (CNN) is designed, n*n collection matrices become the input, and network and NE failure rates become the output, which is then followed by a ping latency-based function to evaluate the CNN output.
Huawei generates a Deep Reinforcement Learning (DRL) model through this design, adds the DRL to the network, and directs the model to begin learning to estimate faults within the network.
Huawei combines networks and AI to research optimum solutions and practices. By harnessing network AI technology, Huawei is striving to reduce the complexity of network O&M and fault prevention to deliver greater stability for the entire network.