Intelligent Lossless HPC Network Empowers Peking University With Unrivaled Performance
"I really need to run this task to catch up with my deadline. The queuing time for resources is way too long. What do I do?"
"My experiment deadline's next week, but I just noticed some data was incorrect. It's going to take more than 100 hours to run the simulation again. Can it go any faster?"
"This experiment is so important to me. The deadline is coming up fast. Will I be able to run my task first?"
What troubles scientific researchers is not only molecular motion, deoxyribonucleic acid (DNA) composition, wind tunnel testing, and complex modeling and simulation experiments, but also having to manage limited computing resources and coordinate around long queuing times.
In order to improve HPC efficiency and reduce the costs of scientific research, the public HPC platform of Peking University organized a vendor appraisal to select an HPC network that can live up to their expectations. Huawei's intelligent lossless HPC network ranked No. 1 due to its unrivaled computing performance.
Peking University took the lead in setting up a computing center among universities in China when it bought its very first computer in 1963. In 2001, it gathered experts from various fields to found the Center for Computational Science & Engineering. This center is positioned as a multi-disciplinary research platform that can serve the university's teaching and research activities. In 2018, the public HPC platform was unveiled, and three clusters — Weiming No. 1, Weiming Teaching No. 1, and Weiming Biological Science No. 1 — were gradually put into operation. The total number of computing cores on the public platform reached 31,732, and the peak computing power 3.65 PFLOPS. The platform provided an HPC environment for a host of disciplines such as mathematics, mechanics, physics, chemistry, biology, and geology.
An HPC platform functions as a key support for a university's scientific research. By May 12, 2023, the HPC platform of Peking University had 5070 users distributed in 96 faculties. The platform has supported more than 545 research projects with a total fund of CNY3.136 billion and over 1400 high-quality papers. It also supported the release of the Gordon Bell Award in 2020. This award-winning project improved the simulation limit of molecular dynamics. It allowed up to 100 million atoms via machine learning, which was astonishing. This is considered one of the most significant breakthroughs made in the computational science field to date.Higher Computing Demands Make Network Reconstruction Urgent
As the number of users on the platform continues to increase, the operation workload is gradually creeping beyond its upper limits. This has led to an unprecedented level of network infrastructure throughput and complexity. Take Weiming Biological Science Number 1 as an example. The node utilization has remained above 95% for a long time. Its maximum task operation time is as long as 109 hours, and the maximum queuing time is 550 hours. It is clear that the reconstruction of the system and network is urgent.
To solve these problems, vendors proposed to use lossless network technologies such as InfiniBand (IB), RoCEv1, and RoCEv2. After strict tests, the public HPC platform of Peking University finally chose Huawei's CloudFabric 3.0 hyper-converged DCN solution due to its unrivaled performance. Based on an intelligent lossless HPC network, this solution is ideal for building HPC clusters that can unleash 100% of computing power and minimize the task operation and queuing times.
The tests focused on the performance of TCP/IP, IB, and RoCEv2 in different application scenarios including the HPC benchmark test tool LINPACK, Community Earth System Model (CESM), and the molecular dynamics software Virtual Analogue Switching Point (VASP).
In the VASP test, Huawei's intelligent lossless HPC network — 100GE RoCEv2 — outperformed IB. In the LINPACK and CESM tests, Huawei's 100GE RoCEv2 had basically the same performance as IB. All of this proved that Huawei's intelligent lossless HPC network could replace IB in real application scenarios.
Huawei's intelligent lossless HPC network solution uniquely enables lossless Ethernet. Compared with the conventional Ethernet, the lossless Ethernet can double computing power at the same server scale. Another highlight of the solution is the CloudEngine 16800 switch. This feature-rich switch offers the industry's highest density of 768 x 400GE ports, and is ideal for building a 10E-level ultra-large compute cluster. Furthermore, Huawei is the only vendor that implements network-assisted computing, that is, in-network computing (INC). As verified by Tolly, the job completion time (JCT) of Huawei's solution is 17% shorter than that of IB.
The HPC platform of Peking University boasts ownership of the leading supercomputing cluster in all of China. The LINPACK efficiency of the entire system consistently ranks first place, which poses extremely high requirements on network performance and reliability. These tests prove again how powerful Huawei's hyper-converged DCN is and help win Huawei more recognition from the supercomputing industry. Looking ahead, Huawei's intelligent lossless HPC network will see its wider application across various fields like education and scientific research, laying a solid foundation for scientific computing, engineering innovation, and high-end scientific research.