The Technologies Behind the World's Fastest AI Cluster — Atlas 900

2020-06-16

On February 25, 2020, the GSMA — an industry organization representing mobile operators — awarded the Global Mobile Awards 2020 (GLOMO Awards) Tech of the Future Award to Huawei's Atlas 900 Artificial Intelligence (AI) cluster. Created by the GSMA to recognize revolutionary technology that can reshape the world, this award marks the industry's recognition of Huawei's Atlas 900 as a breakthrough in AI innovation.

Released at HUAWEI CONNECT 2019, Atlas 900 is an AI training cluster that boasts the world's fastest computing power equivalent to that of 500,000 PCs. Atlas 900 shattered a world record by completing the ResNet-50 ImageNet training in just 59.8 seconds. How was this remarkable feat accomplished?

Atlas 900 system

How did Atlas 900 AI training cluster achieve a speed of 59.8s?

ImageNet started off as a computer vision system recognition project, but has since evolved to become the world's largest database for image recognition. It contains tens of millions of sample images and provides sample data for numerous image recognition AI algorithms. Since its inception in 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has become an authoritative academic competition for the AI industry and has made significant contributions to the advancement of AI technologies. For example, in a span of just seven years, winners of the ILSVRC have increased the image recognition rate from 71.8% to 97.3%, surpassing what humans can achieve.

Today, ImageNet is not only a competition for AI algorithms, but also tests the AI computing power for a substantial number of AI vendors. Indeed, the time required to complete ImageNet training has now become the gold standard for AI computing power in the industry, with public institutions and private companies alike competing to set new records.

In September 2017, UC Berkeley completed ImageNet training within 24 minutes, setting a new world record. This was broken a mere three months later, when UC Berkeley’s Deep Neural Network (DNN) training completed the challenge within 11 minutes. Tencent was the next to break the record by completing the training in four minutes in August 2018.

Every minute slashed and every record broken is exciting news for the AI industry. Requiring approximately ten billion floating-point computations, an ImageNet training task is challenging even for the world's most powerful supercomputers. Remarkably, Huawei's Atlas 900 AI training cluster slashed the completion time to under one minute, winning it the honor of the Tech of the Future Award.

Why is improving the performance of AI training clusters so difficult?

The performance of AI processors is the basis for the overall performance of training clusters so one way to improve computing power is simply to use processors with higher performance. In recent years, the performance of AI processors has grown at an explosive rate. However, a cluster usually involves thousands of AI processors in the computing process. How to make these processors collaborate effectively remains the greatest challenge for the industry.

Processors are key to the performance of a single AI server.

The Atlas 900 AI training cluster uses Ascend processors with the largest computing power in the industry: each processor integrates 32 built-in Da Vinci AI cores, twice the computing power of the industry average. One server can be configured with eight Ascend AI chips, giving it a peak overall floating-point computing power in the petaFLOPS level.

Powerful AI chips alone are still not enough to achieve the ten billion floating-point computations required in AI training such as ImageNet. Multiple AI servers are needed to form a cluster to finish the computations collaboratively. Many have argued that the larger the scale of the AI training cluster, the greater the computing power. This alone would not be sufficient, however, requiring improvements in other areas to boost the overall performance of the AI training cluster.

Packet loss limits the performance of AI training clusters.

Theoretically, the overall performance of an AI cluster made up of two servers is twice that of a single server. In practice, however, the actual performance is less than twice that of a single server due to collaboration overhead. According to industry experience, the maximum performance of an AI cluster made up of 32 nodes can reach only half of the theoretical value. Indeed, more server nodes may even reduce the overall performance of the cluster as AI training clusters reach their performance ceilings.

Performance curve of an AI training cluster

The reason the theoretical value is not reached is due to a large number of parameters that are frequently synchronized between multiple servers when the AI training cluster completes a training. Network congestion worsens when the number of servers increases, resulting in greater packet loss. According to the test data, just one thousandth of a packet loss results in the loss of half of the network throughput. Since packet loss increases with the number of server nodes, and the network will break down when packet loss rate reaches 2%, packet loss is the key factor that is limiting the improvement of AI cluster performance.

How did Huawei overcome this challenge?

As the world's fastest AI training cluster, Atlas 900 connects hundreds of server nodes consisting of thousands of Ascend processors. But how did the Atlas 900 break the performance ceiling and ensure efficient and lossless interconnection between hundreds of service nodes without computing power loss? The key lies in creating a network with zero packet loss.

Creating an intelligent lossless algorithm after seven years of dedication.

As early as 2012, Huawei devoted significant resources to researching and developing next generation lossless networks, to help tackle the challenges that arise from the rapid growth of data. Indeed, Huawei remains committed to building Ethernet networks with zero packet loss and low latency. After working tirelessly for seven years, researchers created an iLossless algorithm solution that uses AI technologies to implement network congestion scheduling and network self-optimization. The iLossless algorithm provides intelligent predictions for Ethernet traffic scheduling, and is capable of accurately predicting congestion status in the next moment based on current traffic status, making preparations accordingly. The mechanism is similar to how congestion predictions for airport runways are made, based on the frequency of take-offs and landings. Scheduling is then performed in advance to improve traffic flow.

As an AI algorithm, iLossless must be trained based on a massive amount of sample data before commercial use. Over the years, Huawei has continued to work with hundreds of customers to further optimize the algorithm. Based on the running scenarios of customers' live networks and unique random sample generation technology, Huawei has accumulated substantial amounts of valid sample data. This has been done so the algorithm can function optimally, with zero packet loss and 100% network throughput for any scenario.

The iLossless algorithm marks an end to the 40 year history of packet loss due to congestion over Ethernet. Recently, under the leadership of Huawei, the Institute of Electrical and Electronics Engineers (IEEE) set up the IEEE 802 Network Enhancements for the Next Decade Industry Connections Activity (Nendica) working group. Intelligent lossless Data Center Network (DCN) is a major breakthrough and has become the new trend for Ethernet development.

Industry's only Ethernet with zero packet loss, enabling Atlas 900 to achieve the world's highest computing power.

At the beginning of 2019, Huawei launched CloudEngine, the industry's first data center switch with embedded AI chips. This next generation switch is the best operating platform for the innovative iLossless algorithm. After many years of research, CloudEngine series switches now incorporate all three AI elements — algorithms, big data, and computing power — and can be deployed on a commercial scale.

Network connection architecture of Atlas 900

CloudEngine series switches are used to build an intelligent lossless Ethernet network with zero packet loss. Atlas 900 is constructed from these Ethernet networks, providing each AI server in the Atlas cluster with eight 100 GE access capability, creating a 100 Tbit/s full-mesh, non-blocking dedicated parameter synchronization network with zero packet loss. The intelligent lossless DCN built with the world's highest-density 400G CloudEngine 16800 not only meets the requirement of zero packet loss, but also supports large-scale 400 GE network evolution. Overall, this ensures linear scale-out performance expansion in the future and continuous peak performance. Simply put, the world’s highest computing power, achieved by Atlas 900, is only made possible by Huawei's intelligent lossless DCN.

Intelligent and lossless DCN achieves three-network convergence DCN architecture.

Huawei's intelligent lossless DCN is not only a high-performance network for AI training clusters, but also applies to cloud and AI data centers as well. Ethernet networks with zero packet loss offer superior performance in terms of storage (including all-flash distributed storage and distributed database), high-performance computing, and big data. According to the test results from the Tolly Group — an independent testing and validation company — Huawei’s intelligent lossless DCN delivers service performance comparable to private networks, which is 30% higher than traditional Ethernet networks.

Building a converged DCN has always been a dream for network operators. In the past, traditional Ethernet networks could not meet the requirements of scenarios such as storage due to packet loss. And despite their shortcomings, namely closed ecosystems and incompatibility with the live network, dedicated networks such as Fibre Channel (FC) and InfiniBand could not be completely phased out, and a certain number of dedicated networks remained deployed.

Huawei's intelligent lossless DCN makes it possible to integrate the three networks of a data center — computing, storage, and service — reducing the total cost of operations by an estimated 53%. Currently, the next generation DCN has been commercially deployed in 47 data centers worldwide, with customers including HUAWEI CLOUD, China Merchants Bank branch cloud, Baidu, and UCloud.

Finally, the intelligent lossless DCN is becoming the foundation of next generation, three-network convergence DCN architecture.

Enterprise

Huawei Cloud

Carrier

Consumer

Corporate

The Technologies Behind the World's Fastest AI Cluster — Atlas 900

How did Atlas 900 AI training cluster achieve a speed of 59.8s?

Why is improving the performance of AI training clusters so difficult?

How did Huawei overcome this challenge?

Intelligent and lossless DCN achieves three-network convergence DCN architecture.