What Kind of Storage Architecture Is Best for Large AI Models?

2025-02-17

Hongxing Guo

Chief Architect of Huawei Scale-out Storage Solution

At the launch event of Huawei's new AI storage products designed for the large-model era, the expression "one storage system for the entire AI process" stands out. What is the entire AI process? Why do we use one storage system for the entire AI process?

The entire AI process generally consists of four phases: data acquisition, data preprocessing, model training and evaluation, and model deployment. Each phase involves storing and accessing mass data. Currently, most customers construct siloed IT systems for these phases and use independent storage clusters for data collection and preprocessing, model training, and inference. However, data in different phases needs to collaborate with each other and the siloed IT systems will face unprecedented challenges in the large-AI-model era.

Technological trends in large AI models

As new AI technologies are developed, large AI models will develop cognitive intelligence. They will have stronger emergent and generalization abilities, more accurate language semantic understanding, and enhanced inference skills. There are currently three major trends in the development of large AI models:

First, the number of parameters in large models is continuing to increase exponentially from hundreds of billions to trillions.

Second, large AI models have evolved from unimodality to multimodality and will evolve to full modality in the future. The size of datasets used to train large models has increased from 3 TB for the NLP model to 40 TB for multimodal models, and is projected to increase to several PBs for full-modal models.

Third, the requirements for computing power are increasing faster than the computing power of a single GPU. As a result, large model training clusters will become larger and larger.

Challenges faced by AI development platforms

The aforementioned trends will bring the following challenges for the AI process:

First, as the training datasets grow, the current mainstream storage architecture of shared storage systems and the local SSDs will not be able to meet the development requirements of large models.

Second, the frequent migration of PB-level data will become the primary factor limiting the production efficiency of large models in siloed raw data storage clusters, data preprocessing clusters, and AI model training clusters.

Third, larger AI clusters will further shorten the mean time between failures (MTBF), and more frequent checkpoints will cause significant write bandwidth challenges for storage systems.

The volume and quality of data determines the competence of large AI models, while the efficiency of data preparation and data transfer throughout the entire AI process determines their end-to-end production costs.

Key technical requirements for large-AI-model services

A storage system that can meet the rapid development of large AI models is critical to improving the production efficiency and reducing the total cost of ownership (TCO). So what kind of storage architecture is the best for large AI models? In my opinion, the best storage architecture for large AI models should have all the following five key features:

(1) The storage system has both a high-performance layer and a large-capacity layer, presents a unified namespace, and is capable of managing data throughout its lifecycle in the following four ways: First, data placement policies can be specified when data is written for the first time. For example, in the data acquisition phase, newly obtained data that needs to be processed within a short period of time can be directly written to the high-performance layer, while data that does not need to be processed within a short period of time or data that is going to be archived can be directly written to the large-capacity layer. Second, cross-tier data flow policies can be specified based on the data access time and frequency, or based on the capacity watermark. Third, data can automatically flow between the high-performance layer and the large-capacity layer based on user-defined tiering policies. The tiering data migration process is undetectable to service applications. Finally, for data that has been tiered to the large-capacity layer, a warm-up policy can be configured by running commands or using APIs to accelerate the cold start of scheduled tasks.

(2) One storage system supports all services of the entire AI process as well as the protocols required by the AI full-process tool chain, including NAS, HDFS, object, and parallel client. In addition, it guarantees lossless semantics for each protocol to match ecosystem compatibility of native protocols. The aforementioned protocols all share the same storage space and each of them only uses the space allocated to them through thin provision. Storage space can be quickly and dynamically allocated to each phase of the AI process.

(3) Data can be transferred efficiently for collaboration between different phases of the AI process. In each phase, the same data and metadata are viewable using the tool chains for different protocol ecosystems. Zero data copy and zero format conversion are required, meaning that the output of one AI phase can be directly used as the input of the next AI phase, without needing to wait for collaboration services.

(4) The storage system can be scaled out to contain thousands of nodes. The system architecture must adopt a fully symmetric architecture and not have independent metadata service nodes. As the number of storage nodes increases, the system bandwidth and metadata access capability can increase proportionally. In the shuffle phase of each AI training epoch, the storage system must be able to provide hundreds of millions of file lists efficiently. In addition, it must support hundreds of millions of training-set files and frequently create new hard links for each file in order to implement version management of the training set.

(5) The storage system can bear dynamic hybrid loads while maintaining high performance. In the data import phase, large and small files are written at the same time. In the data preprocessing phase, large and small files are read in batches and then massive numbers of small files are generated. In the model training phase, the small files are randomly read in batches. In the checkpoint generation phase, the high-bandwidth write requirements must be met. In the model deployment phase, even if the same model file is read concurrently, the aggregated throughput bandwidth of the cluster can still increase proportionally to the number of deployed devices.

Summary

Once we have a storage system with the 5 features outlined above, we can build an AI native data lake storage platform for large AI models. All data that needs to be efficiently processed will be processed at the high-performance storage layer. Data does not need to be frequently migrated during collaboration between different phases of the AI process. This greatly improves the data preparation efficiency of AI big data training, raises the GPU utilization of AI computing clusters, and significantly reduces the investment required for GPU computing and the labor cost associated with data preprocessing. In addition, the development period of large AI models is shortened with the electricity cost reduced. If the storage system has an AI native architecture, the end-to-end TCO of a large model with hundreds of billions of parameters could be reduced by more than 10%.

It is not difficult for a storage system to perform well in one or more I/O models. However, it is rare for a storage system to perform well in all I/O models generated by the full-process toolchain for large-AI-model development. Huawei OceanStor is the exclusive storage system that has the five features and performs well in all I/O models. It has been meticulously designed and produced using Huawei's extensive industry expertise and decades of experience working with scale-out file systems.

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our website at e.huawei.com or contact us.

# Finance # Manufacturing

Hongxing Guo

Chief Architect of Huawei Scale-out Storage Solution

Other Blogs

Enterprise

Huawei Cloud

Opérateurs

Grand public

Corporate

What Kind of Storage Architecture Is Best for Large AI Models?

Technological trends in large AI models

Challenges faced by AI development platforms

Key technical requirements for large-AI-model services

Summary