Advanced Computing in the Age of AI | Wednesday, June 19, 2024

SGI LiveArc Data Management Advances Cancer Research 

Established in 2000, the University of Queensland's Institute for Molecular Bioscience (IMB) in Brisbane, Australia has become leading center for molecular bioscience research. The Queensland Centre for Medical Genomics (QCMG), which is part of the Institute, uses SGI's LiveArc data management to support its human cancer genome sequencing work. LiveArc manages the QCMG metadata and data throughout its entire lifecycle.

The QCMG is part of the International Cancer Genome Consortium (ICGC), a worldwide effort to sequence 50 different types of cancers from 25,000 individuals. By mapping the genetic changes that lead to these cancers, researchers are enabling a more complete understanding of the mechanisms that lead to genetic instability and, ultimately, cancer itself.

QCMG's mission is to sequence and characterize pancreatic and ovarian cancer, two of the most deadly forms of cancer in the developed world. Pancreatic cancer has a 95 percent mortality rate within 12 months of diagnosis, and ovarian cancer, while not quite as deadly, currently has no screening test. It is thus not often discovered until it has spread, which complicates treatment.

According to a case study published by SGI, QCMG runs 11 Life Technology ABI SOLiD V4 genome sequencers and a Life Technology NGS 5500xL that produce half a terabyte of summarized data per week per sequencer for a total of 6 terabytes per week. As sequencer technology is updated, data streams continue to multiply. IT managers are faced with the challenge of cost effectively providing higher levels of utilization across various workflow and data types.

All this data needs to be catalogued with metadata from the scanners and Laboratory Information Management System (LIMS), and then directed across networks to the HPC clusters and storage for processing and analysis. Managing this data and the workflow present major practical challenges for the QCMG staff.

"We need to keep the operations side of the QCMG lean so we can concentrate on research, and that means automating as many of our workflows as possible. That's where we're using LiveArc," explains John Pearson, senior bioinformatics manager at QCMG. "To complete an analysis, we have to manage sequencing, storage and computational resources, as well as move raw and derived datasets from resource to resource."

SGI's LiveArc digital content management software enables automated data management and storage for each step of the QCMG workflow. It handles the data, metadata and workflow processes in QCMG's processing pipeline and is responsible for the following:

• Ingesting data and metadata from the genome sequencers and Laboratory Information Management System (LIMS) for the wet lab.

• Replication of metadata and data from the local data store to a highly available long term tiered data store using SGI DMF tiered storage virtualization solution with geographically distributed tape libraries.

• Sophisticated search facilities to allow researchers  to select the data required for transformation and analysis.

• Automatic creation of workflow jobs, moving data to high speed scratch storage for processing on an SGI cluster system.

• Re-ingesting the resulting secondary and tertiary data and metadata from the HPC analysis process back into the LiveArc managed data repository.

QCMG SGI LiveArc dataflow graphic-sized

Since first deploying LiveArc in 2010, QCMG staffers have automated over 80 percent of the data and metadata capture, analysis, processing and long term archival requirements from the LIMS all the way to cluster processing and archival storage. The resulting data and metadata are shared with medical researchers around the world to advance the understanding of cancer and lay the groundwork for improved treatment options.

LiveArc is available as an add-on to SGI's InfiniteStorage arrays as part of an OEM agreement between SGI and Arcitecta Pty Ltd of Australia. The platform employs a very efficient binary XML object database that, according to SGI, enables orders of magnitude better performance than relational and other databases in a much smaller memory and database footprint.

LiveArc also features minimal administration requirements and open standards data portability. The software supports replication across multiple sites, and provides a parallel federated search and data administration capability across multiple repositories. It was designed to be highly scalable to support expanded data volume and increased file count.

About the author: Tiffany Trader

With over a decade’s experience covering the HPC space, Tiffany Trader is one of the preeminent voices reporting on advanced scale computing today.