Genome Researchers Battling Data Storage Bottlenecks
Genome sequencing designed to determine the complete DNA sequence of an organism is generating petabytes of raw and processed data that is proving extremely difficult to manage and store.
Here is a key metric that illustrates the problem: The cost of DNA sequencing is dropping faster than the cost of storing a byte of sequencing data, according to government estimates cited in a white paper on storing and managing genome sequencing data. The problem is complicated by the fact that researchers seldom if ever delete gene data.
To get their arms around the data management problem, researchers are looking for affordable, scalable data storage with robust I/O performance. But the current lack of data integrity along with memory I/O and other shortfalls have stymied otherwise booming genome research.
In one example that illustrates the dimensions of the storage problem, a university sequencing center quickly used about 90 percent of the 5 petabytes of available data storage. Additional Federal funding is intended to double storage capacity at the sequencing center. Moreover, power consumption at the center is so high that it requires its own electrical substation.
Another strategy is to upload genome-sequencing data to relatively inexpensive cloud storage facilities. The downside is limited bandwidth at data centers, which slows the transfer of gene data.
The white paper also cites the example of one European researcher who uploaded 100 gigabytes of compressed data to an Amazon cloud storage facility in northern Virginia using 40 computers and a very fast Internet connection. The transfer still took an hour and fifteen minutes.
Along with high performance, scalable storage that is affordable, researchers require integrated software that allows them to better manage petabytes of data and billions of files, the white paper emphasizes. In the era of open data, these files must also be accessible to multiple researchers who are liable to generate redundant sets of the same sequencing data.
Researchers exploring the frontiers of genome sequencing also place a premium on data integrity since the mountains of data they generate must be stored for years as colleagues sift through it.
Along with boosting performance and bandwidth along with lowering costs, another potential solution is "tiered storage."
"Multiple tiers of storage allow higher-cost technologies for fast access to be blended with lower-cost near-line and archival storage," the white paper notes. "Managing this dynamic prioritization of data access and tiered storage is an ongoing challenge for researchers."
A potential solution from HPC specialist RAID Inc, of Andover, Massachusetts, is built around software-defined storage and IBM's General Parallel File System. RAID claims its Perseus system addresses many of the pain points faced by genome researchers, including a maximum file system limit of 1 million yottabytes (1 yottabyte equals 1 billion petabytes).
A token mechanism is employed for data integrity to ensure only one file "owner" at any given time. Tiered storage and overall storage performance are said to be improved by distributing gene data across multiple disks attached to multiple servers.
Meanwhile, I/O performance would be improved by installing Perseus at sequencing centers via a Network Shared Disk protocol or direct attachment to a storage area network.
The company did not reveal pricing for its Perseus system.
Given the soaring amounts of sequencing data being generated by relatively well funded genome researchers, a key question appears to be how soon the life sciences will be generating yottabytes of genome data and how will it all be managed and stored? Emerging solutions are sure to drive storage innovation as the human and other genomes are mapped.