Advanced Computing in the Age of AI | Sunday, October 2, 2022

Genome Researchers Battling Data Storage Bottlenecks 

Genome sequencing designed to determine the complete DNA sequence of an organism is generating petabytes of raw and processed data that is proving extremely difficult to manage and store.

Here is a key metric that illustrates the problem: The cost of DNA sequencing is dropping faster than the cost of storing a byte of sequencing data, according to government estimates cited in a white paper on storing and managing genome sequencing data. The problem is complicated by the fact that researchers seldom if ever delete gene data.

To get their arms around the data management problem, researchers are looking for affordable, scalable data storage with robust I/O performance. But the current lack of data integrity along with memory I/O and other shortfalls have stymied otherwise booming genome research.

In one example that illustrates the dimensions of the storage problem, a university sequencing center quickly used about 90 percent of the 5 petabytes of available data storage. Additional Federal funding is intended to double storage capacity at the sequencing center. Moreover, power consumption at the center is so high that it requires its own electrical substation.

Another strategy is to upload genome-sequencing data to relatively inexpensive cloud storage facilities. The downside is limited bandwidth at data centers, which slows the transfer of gene data.

The white paper also cites the example of one European researcher who uploaded 100 gigabytes of compressed data to an Amazon cloud storage facility in northern Virginia using 40 computers and a very fast Internet connection. The transfer still took an hour and fifteen minutes.

Along with high performance, scalable storage that is affordable, researchers require integrated software that allows them to better manage petabytes of data and billions of files, the white paper emphasizes. In the era of open data, these files must also be accessible to multiple researchers who are liable to generate redundant sets of the same sequencing data.


Researchers exploring the frontiers of genome sequencing also place a premium on data integrity since the mountains of data they generate must be stored for years as colleagues sift through it.

Along with boosting performance and bandwidth along with lowering costs, another potential solution is "tiered storage."

"Multiple tiers of storage allow higher-cost technologies for fast access to be blended with lower-cost near-line and archival storage," the white paper notes. "Managing this dynamic prioritization of data access and tiered storage is an ongoing challenge for researchers."

A potential solution from HPC specialist RAID Inc, of Andover, Massachusetts, is built around software-defined storage and IBM's General Parallel File System. RAID claims its Perseus system addresses many of the pain points faced by genome researchers, including a maximum file system limit of 1 million yottabytes (1 yottabyte equals 1 billion petabytes).

A token mechanism is employed for data integrity to ensure only one file "owner" at any given time. Tiered storage and overall storage performance are said to be improved by distributing gene data across multiple disks attached to multiple servers.

Meanwhile, I/O performance would be improved by installing Perseus at sequencing centers via a Network Shared Disk protocol or direct attachment to a storage area network.

The company did not reveal pricing for its Perseus system.

Given the soaring amounts of sequencing data being generated by relatively well funded genome researchers, a key question appears to be how soon the life sciences will be generating yottabytes of genome data and how will it all be managed and stored? Emerging solutions are sure to drive storage innovation as the human and other genomes are mapped.

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).

4 Responses to Genome Researchers Battling Data Storage Bottlenecks

  1. jimheffner says:

    The NSA has bin der and dun dat…..sorta. Who knows more about handling metadata than the boyz and gurlz at the Puzzle Palace?

  2. gdoc says:

    data in the yottabytes+? even random data gets more possible to compress, due to likelihood of larger chunks of repeated data increasing. this amount of data, i would use a learning algorithm, and have the data constantly being compressed.

  3. Glenn K. Lockwood says:

    “The cost of DNA sequencing is dropping faster than the cost of storing a byte of sequencing data.”

    The solution, then, is to simply not store all the sequencing data. If it’s cheaper to re-sequence a genome than store it, then that’s what the industry will do. Storing blood samples in a fridge is both cheaper and more storage-dense than buying racks and racks of disk.

    “The downside is limited bandwidth at data centers, which slows the transfer of gene data.”

    The rate of data generation from a sequencer is quite slow. The problem arises when researchers generate and store data locally, then try to bulk-upload it to the cloud. Several sequencing platforms come with the ability to stream their output directly to the cloud, severely mitigating this problem.

    It seems to me like the bioinformatics industry is already beginning to find solutions to these data problems, and they don’t involve just buying more and more disk.

  4. John Aiken says:

    A multi-tiered storage approach becomes interesting when using a tape-based tier for archiving and data-protection.

    In the hands of experienced GPFS architects, IBM’s tools allow blending tier-1 disk with a tape-based pool; using a particular IBM tool, we’ve delivered 30+PB environments at under $100/TB.

    To be relevant, tape has to delver a massive advantage (both CAPEX and OPEX)while delivering Tier-1 disk performance coupled with automated, policy-driven ILM to ensure files reside on the most cost-appropriate storage tier.

    Also, true data protection (i.e., backup) is possible with this approach. Otherwise, the ‘disk-only’ folks are only too happy to sell 2x the amount of Tier-1 disk required to support off-site replication (btw, replication does not equal backup).

    For a broader sense of what’s possible using the IBM GPFS file system, talk to experienced GPFS solution architects – you might be surprised!

    John Aiken

Add a Comment