Advanced Computing in the Age of AI | Wednesday, April 24, 2024

Lighten ‘Data Gravity,’ Ease Public Cloud Data Transfer for Analytics Workloads 

via Shutterstock

Big data is opening a lot of doors for companies, providing real insights for those that want to understand their customers better and identify new products and services to offer. But big data has a big problem: managing and storing all of that data.

When Apache Hadoop arrived, it was considered an on-site storage panacea. But then came the proliferation of Hadoop projects, presenting major challenges. First, data scientists were trying to contort Hadoop to do things it wasn’t built to do. Second, they discovered they had multiple copies of data across silos, with each line of business creating its own “version of the truth” through multiple iterations and transformations. And finally, given that traditional Hadoop could be scaled in tightly coupled blocks of storage and compute, enterprises found themselves over-provisioned on compute when all they needed was more storage.

In response, many companies have created a “data lake” into which they pour raw, unfiltered, untreated data into a single vast pool. Often that data lake is located in a public cloud, which has the scalability to handle this ever-expanding pool, and corporate users can utilize analytics tools offered by the cloud provider. Some users have access to analytics tools to retrieve data from the cloud to run analytics on-premises. Or companies are investing in on-premises storage that users can draw from to run analytics.

Data Gravity

Unfortunately, all of these variations have drawbacks. Companies find it expensive to handle the high volume of data generated every day, not all of it equally valuable. If the data is stored in a public cloud, the cost of transferring it into, and especially out of, the cloud for analytics is both costly and unpredictable.

This results in what’s known as “data gravity” or “data inertia.” Some organizations try to avoid this by expanding their compute resources by adding more Hadoop data nodes to support the analytical needs of different divisions and departments. But then they run the risk of over-provisioning compute resources and incurring even higher costs.

Additionally, some data scientists find themselves limited by the analytics tools and frameworks offer by their cloud provider. As we know with all things big data, obsolescence happens at the speed of thought. And users’ needs vary wildly; some may need to run web-scale analytics, while others may require very small datasets. Being tied down by outdated analytics tools is as good as not investing in big data analytics at all.

Ending Storage Sprawl

For all these reasons, organizations are increasingly interest in software-defined storage (SDS). Open source SDS solutions break the connection between compute and storage, and thus end big data storage “sprawl.” By using SDS, companies can keep data on-site, stationary and in one place. In effect, this allows users to bring analytics to the data by having the analytics tools side-by-side with the data, eliminating the need to continually transfer it from the cloud.

SDS offers more control, multiple layers of cost efficiency and improved ability to scale to keep pace with data growth. Because the storage is software-defined, it only stores that information that it’s told to store, rather than create a single, gigantic lake from which users must fish the data they require for their analytics workloads. This approach is more cost-effective, too; no more over-purchasing of compute power or unpredictable expenses incurred by moving data from the cloud and back again.

S3A, the Killer App

When Amazon Web Services created its elastic/map/reduce analytics tools, which lets users conduct their own analytics, it encouraged third party and open source providers to support an implementation of the S3 interface (S3A) with a similar look and feel. S3A can speak Hadoop at one end and object storage at the other, making it ideal for enterprises looking to connect to Hadoop and other analytics frameworks for compute, while decoupling from the storage substrate and deploying data on a seamlessly scalable elastic object store.

Open source SDS allows users to utilize their existing analytics tools and, as new tools and apps emerge, they can be added without getting locked into a single vendor. It’s also highly scalable and flexible; because it’s software-defined. That means the definition can be refined and expanded as needed to accommodate more data and more data streams. It can easily support web-scale big data analytics projects.

Irshad Raihan is a senior product team member at Red Hat Storage.