Datafold Raises $20M Series A To Help Data Teams Deliver Reliable Products Faster
SAN FRANCISCO, Nov. 9, 2021 — Datafold, a data reliability platform that automates the most tedious parts of data engineering workflows, today announced its successful $20 million Series A funding round. Backed by NEA and Amplify Partners, Datafold seeks to expand its proactive approach to data reliability to help companies unlock more growth using high-quality data.
“Poor data quality is the primary challenge for companies to become data-driven and a constant source of stress and overwork for data professionals,” said Gleb Mezhanskiy, Datafold founder and CEO. “Top tech companies pour millions of dollars into creating internal data reliability tools and processes, while the vast majority of data teams have to rely on tedious manual testing or risk shipping incorrect data to their stakeholders. We founded Datafold to enable every team that leverages data to make better decisions with tools that help them move fast with high confidence.”
Processing data at scale is more affordable than ever thanks to the modern data stack, but data teams are grappling with an explosion of data pipelines and BI assets — and the resulting lack of understanding, trust, and reliability of data. Data quality has become the number one impediment to leveraging analytics and expanding AI/ML for data-driven companies. This problem is exacerbated by the lack of adequate tools for data testing, monitoring, and observability, along with chronic understaffing of data teams.
As evidenced by top unicorn customers including Thumbtack, Patreon, Faire, and Dutchie, Datafold’s proactive approach to data quality is fundamental to building and maintaining the highest quality data for data-driven organizations. Datafold is on a mission to ensure that no data engineer faces sleepless nights worrying about a hotfix that broke the data or cost their company millions.
Contrarian Approach to Data Reliability
While data quality and observability tooling has been evolving for years, alternative solutions focused primarily on detecting data anomalies in production. That certainly has been an improvement over no observability at all, but post-factum detection of issues has limited value given its reactive nature. The damage is likely already done by the time you learn about broken data, with executives making decisions based on wrong dashboard numbers or ML models retrained with bias.
Datafold’s contrarian approach stems from a different question: How can data practitioners prevent data bugs in the first place? By introducing automated data testing in the change management workflow and integrating it in the CI/CD and code repositories, data creators can catch most issues before they get into production. This is also when data developers have the most time and attention to fix those bugs.
“What started with Data Diff as a tool to prevent breaking changes from merging into production has evolved into a proactive philosophy for data reliability engineering,” explained Mezhanskiy. “The Datafold platform is designed to be an end-to-end solution to the biggest bottleneck for data teams delivering high-quality data products. When teams can develop quickly and with confidence, they are free to create truly revolutionary data insights.”
Datafold is built on the premise of integrating into the daily workflows of data professionals while shifting reliability “to the left,” catching issues as early in the process as possible. As data pipelines can vary as much as data team roles, the Datafold platform proactively mitigates issues across a variety of tools. For example, column-level lineage gives visibility in dependencies. This aids root-cause analysis during incidents and can be used to map out potential problems from changing data models or pipeline refactoring.
“Data-driven organizations are becoming the norm across all verticals — every company is a data company now,” said Peter Sonsini, general partner at NEA and incoming Datafold board member. “The proactive approach to reliability is standard best practice in software but is still nascent in the data space. I’m excited to be a part of Datafold’s future disruptions in the market.”
Long gone are the days when moving fast and breaking things was an option for modern, data-driven companies. Every aspect of business decision-making comes back to high-quality, reliable data. In order to move fast with confidence in the data, the best teams are going beyond best-effort code reviews and 2 a.m. hotfixes, focusing on comprehensive engineering solutions.
“Data quality is critical to make the right decisions and products. However, improving data quality is tedious and challenging,” said Sarah Catanzaro, partner at Amplify Partners. “In contrast, Datafold enables data teams to build reliable data products fast and well. We invested in the company because with their platform, data teams can maximize their impact by iterating quickly without compromising quality.”
End-to-End Data Reliability Platform
As the process of developing data products spans multiple workflows and is often shared by multiple teams, Datafold covers each step.
Change management is one of the slowest and most error-prone workflows for data teams. Datafold’s flagship feature, Data Diff, clearly shows data practitioners how a change in the data processing code will impact the resulting data and downstream products, such as BI dashboards and ML models. Such information is very hard to obtain manually and typically requires hours or even days to avoid breaking changes in production.
When integrated into the CI/CD process, Data Diff automates the data QA process to ensure that every proposed change (pull request) is tested before it gets shipped to production. This saves hundreds of hours that would otherwise be spent on manual testing, creates a standardized testing process across all code changes, and expands the productivity of data teams. It also facilitates data democratization — every organization’s desire to have people outside the specifically trained and always shorthanded data teams to build data products themselves.
“Datafold’s Data Diff is the missing piece of the puzzle for data quality assurance. When I first heard about Datafold, all I could think was ‘Finally!’ It’s an unspoken problem that we all know about and no one wants to talk about,” said Dave Wallace, staff data engineer at Dutchie.
Another fundamental problem that Datafold solves is the ability for data teams and data users to understand the dependencies in data. Simple questions of “Where does this number come from?” or “Will anything anywhere break if we rename that column?” were difficult to impossible to answer given that it’s not uncommon for analytical warehouses to count tens of thousands of tables and over a million columns, all intricately connected.
Using its own SQL compiler, Datafold analyzes every query ever executed in the data warehouse to produce a graph of dependencies to see how data is produced and consumed, with even correlated subqueries, CASE WHEN statements, and other complex queries covered. Plus, numerous Datafold clients use these features during new data practitioner onboarding to let them explore the data more quickly and easily, without additional resources or complicated knowledge transfer.
“Datafold’s column-level lineage gives confidence in the whole system. If my stakeholders ask, ‘Why is this dashboard out of date?’ I can answer in 25 seconds instead of digging through pull requests for hours. As a product owner, I can understand how the rest of the company makes decisions based on the data we produce. It brings data confidence and visibility to the company,” said Maura Church, director of data science at Patreon.
An end-to-end approach to data reliability means that when outside data sources break or ingestion pipelines go down, Datafold’s smart alerts can promptly notify those who need to be informed. This lets data owners take any necessary actions and inform their stakeholders, further building trust and confidence in data teams and their products.
“Datafold’s proactive features reduce the need for alerts, but some changes are out of our control and alerts can give a ping. Anomaly detection lets me know instantly that there was no Facebook spend the day before, instead of having to look at 20-plus Tableau dashboards, so that I immediately see and know that something is going on,” said Callie Davis, vice president of customer data and insights at Nutrafol.
Datafold is a data reliability platform that helps data teams deliver reliable data products faster. It has a unique ability to identify, prioritize, and investigate data quality issues proactively before they affect production. Founded in 2020 by veteran data engineers, Datafold has raised $22 million from investors including NEA, Amplify Partners, and YCombinator. Customers include Thumbtack, Patreon, Truebill, Faire, and Dutchie. For more information, visit www.datafold.com.