Two decades ago, the Human Genome Project (HGP) announced its initial results: the genome sequence and genetic map of the entire human body, from both functional and physical standpoints. Since then, advances in big data, storage, and computing technologies have ushered in a digital genome era, one where genome sequencing and analysis are essential in the fight against viruses, especially Severe Acute Respiratory Syndrome (SARS) and the current Coronavirus Disease 2019 (COVID-19).
That said, despite these advances, it's still far from easy to decode genes. Indeed, given the human genome — distributed in 23 pairs of chromosomes in the nucleus — contains 60,000 to 100,000 genes and approximately three billion base pairs, sequencing and analysis is, in fact, incredibly difficult. Scientists spent US$3 billion and 13 years to complete that first human genome sequencing. Fast forward to 2019 and the process for individuals to undergo Whole Genome Sequencing (WGS) was cut to just a single day. And then, in 2021, West China Hospital (WCH) of Sichuan University, located in the city of Chengdu, Southwest China, took the next step.
In September 2021, WCH, together with bioinformatics data analysis solution provider, Sailegene, and Huawei, jointly released an acceleration analysis platform for multiomics data (which combines the data sets of different omic groups). This platform shortens the analysis time of a 30X human WGS germline mutation from 24 hours to just 7 minutes: a feat of huge significance, signaling a new chapter in the exploration of human life.
WCH is an important, national-level base for medical research and technological innovation. Indeed, it has ranked in first place for five consecutive years in the Chinese Hospital Science and Technology Evaluation Metrics (STEM), conducted by the Chinese Academy of Medical Sciences. As such, its West China Biomedical Big Data Center — an open platform for sharing research and applying health and medical big data — collects and analyzes big data in biomedicine to optimize all aspects of clinical medical treatment. The center is committed to building an efficient multiomics analysis platform to support the rapid transformation of large-scale analysis of WGS results in clinical practice.
Functions and Positioning of the West China Biomedical Big Data Center
With multiomics data analysis now cast as the foundation of precision medicine and medical big data, countries around the world are gradually starting to invest in genome sequencing programs for population cohorts: this marks the arrival of genome big data. WCH, for example, started a WGS program for 10,000 elderly natural population cohorts of different ethnic groups in western China in 2018 and for 100,000 Chinese patients with rare diseases in 2020. Of course, large-scale analysis of WGS results requires a high-performance genome analysis platform.
Genome sequencing, a process of analyzing and determining the complete sequence of genes from blood or saliva, consists of three steps: extraction, analysis, and interpretation. It essentially converts non-visualized bioinformation into library preparation (extraction) and reduces the deviation between text information and bioinformation by using probability and statistics (analysis) for research (interpretation). This analysis involves file format conversion, decompression, gene splicing, sequence alignment, sequencing, deduplication, mutation detection, and joint genotyping. Its reliance on the performance of the bioinformatic analysis system makes it a main focus for High-Performance Computing (HPC) solutions for genome sequencing. For WCH, then, what were the specific requirements?
Obviously enough, high data volumes needed to be handled efficiently. Genome sequencing generates TB-levels of data. For example, a single DNBSEQ-T7 sequencer — made by MGI, a leading producer of high throughput genome sequencing machines — produces 4.5 TB of data every 24 hours and 6 TB over 30 hours. This means that, under full load, it generates approximately 1.7 PB annually. In addition, the intermediate files and results from bioinformatic analysis are about five times the raw data volume. Therefore, one DNBSEQ-T7 requires approximately 8.5 PB of effective storage capacity annually, to generate, store, and analyze data. Storing genetic data for long periods at low costs with automatic management of online, offline, and archived data is, therefore, a significant challenge.
WCH also required a solution built for application-driven scientific computing workflows as well as the hybrid workloads of heterogeneous computing. Genetic data analysis requires Input/Output (I/O)-, Central Processing Unit (CPU)-, and memory-intensive services for different research purposes, along with dedicated software and different computing scenarios . The analysis and mining of mass genetic data therefore requires streaming, processing, and high-performance Graphics Processing Unit (GPU) and CPU heterogeneous computing clusters. In addition, sequence alignment, often performed during the analysis of genome sequences, requires a one-off import of mass data into the memory for processing, translating into high capacity requirements.
Putting it quite simply, ultimate storage performance was also required. Mass data transmission imposes huge pressure on the network and its bandwidth. In addition, the computing process involves high-speed data sharing, read, write, and search, which requires high I/O bandwidth of storage and devices. A storage system must deliver a throughput of at least 6 GB/s and high real-time performance to ensure data integrity.
Challenges and Requirements of Data Infrastructure for the Genomics Data Analysis Platform
The innovative cooperation between WCH, Sailegene, and Huawei has taken full advantage of WCH's leading academic and industrial advantages in multiomics data analysis and genome application, Sailegene's industry experience in GPU-accelerated bioinformation data analysis, and Huawei's technical accumulation in high-performance data storage and advanced genetic data management systems. Combining these strengths has resulted in an acceleration analysis platform for data and storage technologies, allowing the biotechnology industry to operate at far faster speeds, promoting — even leading — the digital transformation of the healthcare industry.
WHS-IMOAP Acceleration Analysis Platform for Multiomics Data
WCH has now deployed the optimization solution, which uses a top-level architecture that analyzes running data to identify performance bottlenecks by using high-performance software algorithms. The hospital has also set up a Research and Development (R&D) team and an acceleration analysis platform for multiomics data, creating a world-leading high-performance genomics analysis platform.
Sailegene has provided its ultra-fast Next Generation Sequencing (NGS) data analysis platform — BaseNumber — which optimizes a single thread to multiple threads, adds an intensive concurrent read/write mode for disks, and improves write bandwidth to 6–12 GB/s as well as I/O throughput. BaseNumber also offers a fast cache synchronization mode to accelerate the read and write of large files.
Huawei OceanStor Pacific mass data storage provides a storage foundation that enjoys extremely high performance. Sequence alignment, remember, imposes a high requirement on the single-thread bandwidth of storage. Compared with WCH's legacy storage devices, OceanStor Pacific provides double the single-thread read bandwidth and four times the single-thread write bandwidth. It offers aggregated bandwidth of 30 GB/s read and 25 GB/s write, with four nodes, significantly improving the performance of the multiomics joint innovation platform.
The comprehensive innovation that has helped to achieve the analysis of WGS results in just 7 minutes — based on architecture, computing, and storage — is three and a half times faster than WCH's legacy platform and 180 times faster than a traditional solution. At the 6th Biomedical Big Data and Intelligent Technology Application Summit, Dr. Yu Haopeng, a data scientist at the West China Biomedical Big Data Center, formally released the WHS-IMOAP high-performance genome analysis joint solution, announcing that the era of multiomics big data had arrived .
All of this will only accelerate the wider application of precision medicine big data in healthcare. While once limited to scientific research laboratories alone, genome sequencing is now being widely used in clinical applications, providing optimal results. With big data and Artificial Intelligence (AI) technologies, WCH has, in effect, conducted in-depth cross-disciplinary research, integrating medicine and engineering, all made possible through comprehensive collaboration between industry players, academia, research institutions, and applications. And the end result? The hospital has built a new full-lifecycle healthcare system, one that helps patients get better and recover faster — the ultimate aim of healthcare, at the end of the day.