Huawei OceanStor AI Storage Accelerate the Evolution of IFLYTEK's SparkDesk Cognitive Model
Training resumed within 1 minute, 15-time faster than before
このサイトはCookieを使用しています。 サイトを閲覧し続けることで、Cookieの使用に同意したものとみなされます。 プライバシーポリシーを読む>
企業ユーザー向け製品、ソリューション、サービス
iFLYTEK is a well-known company in the Asia-Pacific region specializing in the development of intelligent speech and artificial intelligence (AI) technologies. It owns the National Engineering Research Center of Speech and Language Information Processing and the State Key Laboratory of Cognitive Intelligence. In order to quickly deploy a high-performance platform to train and roll out large AI models and gain a competitive advantage, iFLYTEK has partnered with Huawei to create a full-stack large AI model solution encompassing storage, compute, and networking, and to build Feixing No. 1—China's first ultra-large-scale computing platform that trains large AI models with trillions of parameters.
Huawei AI data lake storage solution plays a pivotal role in supporting the platform. It provides high storage capacity and exceptional performance. The tiered storage architecture uses multiple sets of OceanStor professional storage offering a total storage capacity of dozens of PBs. This solution also leverages intelligent data tiering, multi-cluster fault isolation, and efficient data governance to deliver superb storage performance and TB-level bandwidth, accelerating all stages of large AI model development.
iFLYTEK has kept improving its SparkDesk cognitive model using mass data and large-scale knowledge. The latest version of SparkDesk model trained on Feixing No. 1 is able to implement a closed-loop troubleshooting process from problem identification to solution planning and execution. This shows that AI technology has evolved from being a specialized field focused on perceiving and understanding the world, to one focused on shaping and transforming it. Such improvements in AI technology pose new challenges for data storage.
• Low cluster utilization: Training of large AI models mainly involves multi-node and multi-computing-card training jobs, which have high fault rates. High-performance storage I/O and bandwidth are needed for checkpoint reads and writes during model loading and training resumption. For example, a training cluster with thousands of computing cards experiences a fault about once a day and takes at least 15 minutes to resume training, which leads to a significant monetary loss each time a fault occurs.
• Separate and unreliable clusters: iFLYTEK's live network used to consist of multiple siloed storage systems from different vendors. They provided a total storage capacity of dozens of PBs in the form of separate PB-level clusters, leading to highly complex management. In addition, software- and hardware-independent deployment reduces the reliability and bandwidth of storage clusters.
• Difficult data governance: The training dataset of a large AI model consists of tens of billions of files. These files are usually stored in separate clusters, causing data silos and inefficient manual migration. In addition, a lack of global data visualization resulted in failure to identify hot and cold data and high-value data, making data governance difficult.
Therefore, storage for large AI model vendors must feature:
1. A high-performance foundation that optimizes multi-node and multi-card AI cluster training duration, accelerates training resumption, and reduces rollback
2. Unified management of AI data lake storage and efficient and reliable data governance
Feixing No. 1 is China's first ultra-large-scale computing platform to train foundation models. It leverages a decoupled compute and storage architecture that combines strengths of the heterogeneous computing platform and Huawei Data Storage. While iFLYTEK focuses on unleashing the computing power, Huawei Data Storage creates an AI data lake storage foundation using multiple sets of OceanStor AI storage products to provide dozens of PBs of capacity and ensure reliable and efficient storage.
AI data lake solution architecture
15-time faster training resumption, leading to significant monetary savings
The training cluster provides up to TB-level bandwidth for faster checkpoint reads and writes. This accelerates training resumption from 15 minutes to one minute, 15-time faster than before.
Unified cluster management for high resilience and 99.999% reliability
Huawei OceanStor AI storage supports the unified management of multiple storage pools in a single cluster. The separation of data planes prevents a fault in any storage pool from affecting other storage pools in the cluster. At the same time, the reliability of a single cluster can reach 99.999% due to storage reliability functions such as subhealth management and high-ratio EC.
Cost-effective data governance, reducing the TCO of full-lifecycle management by 30%
Unified data lake management is achieved using the global file system (GFS) and seamless multi-protocol interworking, thus eliminating data silos. Global data visualization and management enable efficient data flow, 3-time more efficient cross-region scheduling, and zero data copy, thereby accelerating AI model development. Retrieval from hundreds of billions of metadata in seconds, intelligent identification of data access frequency, and precise data tiering optimize the balance between storage performance and capacity.
Huawei OceanStor AI storage is designed to accommodate the growing scale of computing clusters. It serves as an AI data lake storage foundation for iFLYTEK's groundbreaking ultra-large-scale computing platform. This storage foundation will promote the ongoing evolution of the SparkDesk cognitive model using massive amounts of data and knowledge, and assist iFLYTEK in building a better world with AI by helping machines to listen, speak, understand, and think.