Advanced Computing in the Age of AI | Friday, March 29, 2024

New Tools Bridge the Data Lake-Data Warehouse Divide 

Source:Shutterstock whiteMocca 1024337068

For 30+ years, we've had the notion of a data warehouse, a single, central store of vetted data that is used for authoritative analysis. But building the data warehouse is hard work – lots of discussion and negotiation on the right structure and meaning of the data, lots of technical heavy lifting to make the data represent what we want it to represent, and lots of work to keep the system available for use by a wide range of users.

Data lakes emerged over the past decade as a reaction to several factors that made data warehouses untenable for many projects: 1) the scale of data - datasets that exceeded the abilities of traditional infrastructure and software, 2) the costs of data - systems whose cost of ownership didn’t represent a useful return on spend, 3) the time to insight - how long it took for data to become available for analysis.

Data lakes are available under different brands and include options in the cloud from Amazon S3, Azure ADLS, and Google Cloud, as well as on-prem offerings from a variety of storage vendors such as EMC and Cisco, as well as software offerings, such as Apache Hadoop,  which is provided by Cloudera and Hortonworks.

These offerings address the limitations of the data warehouse by 1) providing a horizontally scalable solution aligned with modern hardware trends, 2) a cost model that is either based on open source software or “pay by the drink” cloud services, and 3) a much less rigorous “first step” for incorporating data into the system through the use of schema-on-read offerings that allow some workloads to be delivered much more quickly to data consumers.

While the talk in the market has been about replacing the data warehouse with the data lake, for most companies the data lake represents a complementary solution that provides advantages with tradeoffs. Most companies today have both, and are working to integrate the capabilities of both solutions to meet the needs of their data consumers.

The data lake paradigm grew rapidly across Global 2000 companies and beyond to be a mainstay in most enterprise architectures. However, the surrounding ecosystem of enterprise tools for data consumers became incompatible with the data lake. As a result the promise of this new approach became useful to a small fraction of enterprise users, the 1 percent who are software engineers capable of working with these new systems through low-level interfaces in languages like Python, JAVA, and C#, as well as systems like Apache Spark.

Meanwhile, users of popular BI tools like Excel, Tableau, Power BI, Microstrategy, Qlik, and others, who outnumber the 1 percent by a factor of 100 (at least), were left staring across a great divide at the world they know and a world of data they cannot reach without enormous support from the data engineers of IT. As a result, the vast majority of data consumers are unable to reach the data in the lake, and compromises are made that result in higher costs, slower time to insight, greater security vulnerability, greater dependency on IT, and lower employee satisfaction. If those concerns sound somewhat generic, consider for a moment how many people in your organization rely on data for their day-to-day tasks, and how different their experience is at work compared to their experience answering questions on Google and other apps where data is easy, instantaneous, and entirely self-service.

While data lakes provide an agile, low-cost way for companies to store their data, without the right tools to govern and access it – and without a clear goal for what it is supposed to achieve – the data lake can grow stagnant and become a data swamp.

Recently a new class of solutions has been created to simplify and accelerate how data consumers use data from the data lake. Data-as-a-Service (DaaS) platforms sit between the data lake and all the different tools of data consumers mentioned as well as data science platforms, such as Python, R, and SAS.

DaaS platforms are designed to be self-service for data consumers, so they can do their jobs independently and without waiting on IT to make a copy of the data for their particular needs. These solutions also take care of some of the really challenging data engineering challenges, such as making the data interactive or securing and tracking data lineage, which have traditionally required enormous manual effort and expensive proprietary tools. In addition, DaaS platforms are open source, so they fit into the way companies deploy and manage their infrastructure today.

Every company today is data driven. As we move to a mixed model of data warehouse and data lakes to support the diverse analytical needs of the data consumer, it is essential to consider how the data will be used by the 99 percent. DaaS platforms bridge the gap between these two worlds, and help companies get more value from their data, faster. In addition, these solutions make data engineers more productive, and they make data consumers more self-sufficient. Because they are open source, DaaS platforms are a new class of technologies that can be part of the enterprise architecture of every company.

Kelly Stirman is the VP of Strategy at Dremio.

EnterpriseAI