Advanced Computing in the Age of AI | Monday, June 24, 2024

Can Biology Handle Big Data? 

When the Human Genome Project began in 1990, it was one of the most ambitious scientific endeavors to date. But today sequencing a human genome is considered old hat compared to those of microorganisms in our own bodies, which are made up of roughly 100 billion DNA base pairs to our 3 billion bases.

The effort to study and understand cornucopia of microbes present within the human body, called the human microbiome, is quickly bringing the field of biology into big data’s domain.

Quanta Magazine’s Emily Singer observed that this entry is due in part to the rapidly falling cost of DNA sequencing, which in the past five years has fallen even more quickly than the cost of computer chips. Along with the resources made possible by public genome repositories, such as that of the National Center for Biotechnology Information, scientists are now producing 15 petabytes of sequence-related information every year, which presents the field with a problem.

Biologists will soon have to overcome obstacles from how to move and store data to then integrate and analyze it, which Singer says will require a cultural shift for the field in addition to a shift in infrastructure. Even today the easiest way to transfer biological data is to ship hard drives via postal mail, as computing can often be more expensive than the experimentation itself.

“In physics, a lot of effort is organized around a few big colliders,” said Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York. “In biology, there are something like 1,000 sequencing centers around the world. Some have one instrument, some have hundreds.” Which means for biology, big data brings with it a significant problem in aggregating and analyze individual pieces of the larger genome puzzle.

But according to Jeff Lichtman, a Harvard neuroscientist, the question that awaits beyond funding and advances in hardware, software and analytics is about just how much big data can deliver. Nonetheless, Lichtman is sure that those findings will come in time. “I feel confident that you don’t have to know beforehand what questions to ask,” he said. “Once the data is there, anyone who has an idea has a dataset they can use to mine it for an answer.”