CloudFabric: Leading DCNs into the Intelligence Era
Having experienced the agricultural and industrial eras, the world is now entering the digital economy era, which is emerging due to the rapid development of Information and Communications Technology (ICT). According to a survey conducted by Gartner, 75 percent of large enterprises have already transferred their strategic focuses to digital transformation. While the most critical production elements were land and labor in the agricultural era and capital and technology in the industrial era, data and intelligence have taken their place in the digital economy era. A deluge of data is generated during digital transformation, which has become part of enterprises’ core assets. However, data is not an end in and of itself: rather, it is knowledge and wisdom that remain our true pursuits. In this context, the focus of enterprise digital transformation is how to harness the power of Artificial Intelligence (AI) to gain genuine insight from transient data, and ultimately monetize such data. As such, AI has become the key driving force for enterprises to reshape their business models, improve their customer experience, and redefine their futures. +AI signifies a key milestone for enterprise digital transformation in the intelligence era.
AI is driving Data Center (DC) reconstruction as Data Center Networks (DCNs) face new challenges. Intelligent upgrades of enterprises drive DCs to transition from the cloud era into the AI era. Compared with traditional DCs, cloud DCs are more like service support centers, with applications at the core, and can quickly provision IT resources through a cloud platform. From this foundation, the AI DC goes further still, evolving into a business value center that focuses on how to efficiently process data using AI.
Without a doubt, running AI efficiently requires an enormous amount of computing power. For example, a common AI training for speech recognition involves 20E (1E = 1018) floating-point operations. Even if the world’s most powerful supercomputer is used, it would take an extended period of time. Such stringent requirements for AI computing power are the driving force behind the evolution of DC architecture. The emerging DC architecture in the intelligence era is characterized by all-flash storage data lakes serving as the core, with GPU/AI diversified computing as the computing base. Additionally, storage and computing facilities are both undergoing drastic changes. All-flash storage, for instance, has improved storage performance 100 fold while GPU/AI intelligent computing has also improved computing performance 100-fold.
If the running efficiency of a single server is accelerated by improving the performance of the processor and storage medium, the running efficiency of the entire DC can also be improved by enhancing the performance of the DCN. Indeed, DCNs have become the impetus for unleashing the DC computing power and monetizing data value in the intelligence era. As an enabling technology in the intelligence era, AI presents both new opportunities and challenges for DCNs seeking to complete intelligent upgrades and improve deployment and O&M efficiency.
As the key to unlocking the gold mine that is data, AI is essential to the success of enterprises’ digital transformation and intelligent upgrade. The pervasive use of AI technologies has driven disruptive changes in the mission of enterprise DCs. As AI technologies are widely used in DCs, Huawei has upgraded the CloudFabric solution to help enterprises overcome the new challenges.
World’s Highest-Density 400GE DCN, Connecting Enterprises to the Intelligence Era
Enterprise digitalization has led to an exponential increase in global data volume every year. Huawei GIV predicts that the data volume will reach 180 ZB by 2025, a 20-fold increase in a span of just 10 years. Currently, 100GE DCNs cannot cope with the challenges posed by the surge in data volume expected over the next few years. In addition, from the perspective of mainstream AI service servers in the industry, 100GE NIC interfaces have become standard configurations, indicating that the 400GE era has arrived.
In 2019, Huawei launched the industry’s first DC switch, CloudEngine 16800, which is designed for the AI era. The CloudEngine 16800 has upgraded the hardware switching platform and made breakthroughs in multiple fields, achieving ultra-high-speed signal transmission, super heat dissipation, and efficient power supply based on the orthogonal architecture. It provides the industry’s highest-density 48-port 400GE line card in a single slot and the industry’s largest 768-port 400GE switching capacity. With five times the industry average switching capacity, CloudEngine 16800 easily satisfies the traffic multiplication requirements in the AI era.
The core of the intelligence era is to introduce AI to mine data value. AI computing, characterized by deep learning, depends on the input of massive data, and the data access speed directly affects the computing power. Improvements in both computing and storage performance, however, further deteriorate the congestion and packet loss issues on the traditional network. In the AI era, even 0.1 percent packet loss will directly cause the computing power to decrease by nearly 50 percent. Even worse, packet loss will become more serious as the service load and distributed computing traffic increase. Moreover, because computing power of AI DCs is so expensive, insufficient computing power has become a major challenge. Even when computing power is available, it cannot be fully used due to network bottlenecks. Building a lossless DCN, therefore, has become a priority for many in the AI era.
Huawei CloudEngine 16800 is the industry’s first DC switch equipped with high-performance AI chips and features an innovative iLossless algorithm that implements adaptive traffic model optimization. Intelligent and lossless DCNs built based on CloudEngine switches implement zero packet loss on the Ethernet, fully unleashing the potential of AI computing power. As verified by Tolly, Huawei’s intelligent and lossless DCN achieves 27 percent higher AI training efficiency than other networks in the industry when the same GPU cluster is used.
Huawei’s intelligent and lossless DCN has been applied to the Atlas 900 AI training cluster, which boasts the world’s highest computing power. Indeed, the intelligent lossless DCN was the key to enabling Huawei to break through the performance bottleneck to set a new world record. Besides being a high-performance network oriented to AI training clusters, Huawei’s intelligent and lossless DCN is also a next-generation network architecture oriented to DCs in the intelligence era.
The autonomous driving DC, which first implements full intelligence of the network before advancing towards autonomy and self-healing, is constantly growing in scale, and its structure is becoming increasingly complex. The Operating Expenditure (OPEX) of some DCs may even be three times higher than the Capital Expenditure (CAPEX), and the efficiency and cost of DCs face structural challenges. Even if the mainstream SDN is used to implement automatic network deployment, administrators still need to understand service intents, perform routine network inspections, and locate and rectify faults.
Huawei was the first to propose the autonomous driving network concept. Based on the SDN network architecture, Huawei introduced AI technologies in the end-to-end process of planning, deployment, running, maintenance, optimization, and operation for network devices, network management and control, and upper-layer service orchestration systems. Through AI technology, networks have evolved: automated service deployment and action execution are replaced with intelligent fault self-healing, network self-optimization, network autonomy, and self-healing, free from any manual interventions.
The fully intelligent AI-powered CloudFabric solution can preliminarily implement intelligent understanding of service intents, intelligent selection of the optimal network path, intelligent evaluation of change risks, intelligent fault detection, and quick location of root causes. For 75 types of common faults, the solution can detect faults within one minute, locate them within three minutes, and rectify them within five minutes. The solution is the first to implement the industry’s first L3 autonomous driving network in the DCN field as certified by Tolly.
Around the year 2000, with the development of enterprise informatization strategies, real enterprise DCs were born.
In 2010, Huawei proposed the enterprise digitalization strategy. As cloud computing boomed, Huawei took the lead in releasing the industry’s first cloud DCN, CloudFabric, leading DCs into the cloud era, realizing the elastic scaling and automatic provisioning of IT resources.
Enterprise digital transformation has entered a new phase of intelligent upgrade. As AI is widely adopted in DCs, Huawei has upgraded the CloudFabric solution. Huawei CloudFabric is the first solution to offer full intelligence for DCNs and implement the industry’s first L3 autonomous driving network. In addition, Huawei CloudFabric uses the world’s highest-density 400GE CloudEngine switches with embedded AI chips and an innovative iLossless algorithm. The solution also uses the industry’s only intelligent and lossless DCN with zero packet loss, which unleashes the full computing power potential for AI. It enables AI services to run more efficiently while fully monetizing the value of data, leading DCNs into the era of intelligence.
Data has become the core factor of production in driving economic growth, and whoever has the leading “data infrastructure” can gain an edge. DCs have become a strategic high ground for the digital economy. To that end, enterprises are prioritizing the optimization of DCs to more effectively unlock the computing power potential and data value.