[Shanghai, China, September 20, 2023] During HUAWEI CONNECT 2023, Xinghe Network White Paper was jointly released by Huawei, China Academy of Information and Communications Technology (CAICT), and iFLYTEK Engineering Institute at the Xinghe Network Summit hosted by the Data Communication Product Line. This white paper delves into the development trends, network architecture, and key technological innovations of AI services, demonstrating the technical leadership of Xinghe Network Solution. Aiming to promote digital intelligent transformation across industries and propel industry upgrade and cooperation, the white paper provides a reference for building a high-performance network for foundation model training.
Guo Liang (left), Chief Engineer, Cloud Computing and Big Data Research Institute, CAICT
The white paper points out the major challenges that the network faces in the foundation model era. To begin with, recent years have witnessed AI algorithms entering an era of foundation models with trillions of parameters along with computing power requirements increasing nearly 100,000 times. Foundation model computing requires efficient collaboration between tens of thousands of AI processors. As such, the network needs to be continuously optimized to improve the parallel computing efficiency. In addition, due to the high cost of AI processors, a high-performance network featuring zero packet loss and high throughput is urgently needed to fully unleash the AI processor efficiency. Furthermore, training a foundation model takes a long time, and a cluster with about 10,000 GPUs/NPUs can produce hundreds of thousands of flows. Efficient O&M methods are required to reduce the mean time between failures (MTBF).
To address these challenges, This is where the Xinghe Network Solution comes in.. It stands out with the following innovations:
• High performance: The innovative AI accelerator — network scale load balancing (NSLB) technology — improves effective network throughput to 98% and increases AI training efficiency by 20%.
• High reliability: The Data Plane Fast Recovery (DPFR) technology enables application-unaware rapid link failover in sub-milliseconds.
• High maintainability: The visualized operations and maintenance (O&M) solution enables high-precision data collection and one-click network fault diagnosis, improving the in-training troubleshooting efficiency by 90%.
• Large scale: Xinghe Network supports cluster training of about 10,000 GPUs/NPUs, 4x higher computing power than the industry second.
• High openness: Huawei's hyper-converged Ethernet solution can fully reuse the Ethernet ecosystem, delivers almost the same performance as its counterparts, and reduces O&M costs by 30%.
AI models have shifted from an era with tens of thousands of small-scale models to a new era with a myriad of foundation models that differ in scale and form, bringing new network requirements and challenges. Oriented to these requirements and challenges, Xinghe Network Solution continuously optimizes the network architecture and innovates network technologies to provide a reference for building a high-performance network for foundation model training, thereby promoting the development of AI technologies.
So far, Xinghe Network Solution has been deployed and put into commercial use in more than 100 enterprises around the world. Looking ahead, we are ready to work with more partners to promote technology advancement and application scenario expansion, and ultimately achieve the sustainable development of AI technologies and the prosperity of society.
For more information about the white paper, visit: https://e.huawei.com/cn/material/enterprise/03173eb3ef52423c9dd4cd24e5d3ef48