Databricks on Mission to Build First Enterprise AI Platform
When Databricks emerged on the scene back in 2013, some people assumed it would follow in the footsteps of other commercial open source vendors making waves at the time. After all, Databricks was founded by the people behind Apache Spark, including Matei Zaharia, who created Spark while studying at UC Berkeley’s AMPLab under Ghodsi, as well as other folks associated with the AMPLab or Berkeley’s computer science department, such as Ion Stoica, Reynold Xin, Andy Konwinski, and Arsalan Tavakoli-Shiraji.
If Spark was the Next Big Thing to come after Hadoop, as it certainly was and continues to be, then it seemed to follow that Databricks would find some way to monetize Spark in a similar manner. Databricks would be to Spark what Cloudera is to Hadoop or what MongoDB is to NoSQL databases. But that’s not how the story played out. Instead of riding Spark as a one-trick pony, Databricks has gone in a different direction. While Spark plays a part in the company’s plan, it’s just one element of an increasingly diverse set of software as a service (SaaS) offerings that Databricks operates on behalf of its customers across all the major public cloud platforms.
Emergence of Enterprise AI
In an interview with Datanami, Ghodsi broke out the pitch stick and explained how these offerings are all supporting cast members in Databricks’ ultimate effort, which is to build the world’s first enterprise AI platform.
“Everybody is always wondering who’s going to be the first company that goes public that is an enterprise AI platform company. There hasn’t been one yet, right?” he said. “How do you help enterprises build AI into their existing software and solutions that they already have? Every software that’s on the planet that we know today – I mean exactly every software that exists – will over the next 10 years become much more intelligent. They will add lots of AI capabilities over the next 10 years, or they will go out of business and be out-competed by some other company that has those capabilities.
“We’ve already seen it with Uber displacing the medallions, Airbnb versus hotels, Amazon versus retailers,” Ghodsi continued. “It’s already happened. All those companies are heavily using AI. What about the rest of the Global 5000? They have software. They have customers that have been in business for many decades, maybe hundreds of years, and they have huge data sets. Can they leverage those and build an AI for their software? I think many of them will and they will survive and some will be displaced by new companies that leverage AI. What AI platform are they all going to leverage? Today there is no answer…So I think there’s a place where there will be platform company that builds that sort of AI platform that all these software companies will leverage in their solutions in the future.”
Enterprise computing being what it is, the industry will probably pick one, two, or three AI platforms that become the winning platforms, the standards, if you will, just as the industry picked relational databases to back the first generation of enterprise software, like ERP, CRM, financials, HR software, etc. according to Ghodsi. There is currently no dominant enterprise AI platform, although there are players like Amazon Web Services, Microsoft, and Salesforce that are making clear strides to become one. In Ghodsi’s view, Databricks has the right elements in place to be a major player in the market, if not the dominant one. Software projects are part of the equation, but it goes beyond that, and reflects on Databricks’ core business model as well.
“The innovation around open source is absolutely key,” he said. “In our case there are four projects. [There’s] Spark, which everybody knows about. But the biggest innovation that we’ve done ever since we started on this is the Delta Lake project. Over 80 percent of our customers use that. In terms of value, that’s probably the most valuable open source project at Databricks. Even though it’s not as known as Apache Spark.”
MLflow is another core part of the Databricks playbook. The software, which is being spearheaded by Zaharia when he’s not teaching at Stanford University, provides some standardization to the complex processes that data scientists oversee during the course of building, testing, and deploying machine learning models. According to Ghodsi, MLflow is being downloaded at the rate of 800,000 per month, and has more contributors than Spark did at the same age.
The fourth core element for Databricks is Koalas, which is helping to bring the data science innovation in the Pandas community to Spark users. According to Ghodsi, Koalas lets data scientists go from programming in Pandas on a laptop to scaling the workloads up onto huge distributed Spark clusters with just a few API calls.
IP in the Delivery
But aside from the software itself, Ghodsi maintains that Databricks holds a key advantage with its business model. Ghodsi has consciously charted Databricks away from the classic commercial open source business model, where software is given away and the vendor charges for support and services. That model, which Ghodsi dubs the Red Hat model, works in an on-prem world, but it doesn’t have a solid place in the new cloud world.
“Our business model is different,” Ghodsi said. “Our business model is managed SaaS services in the cloud. Managing these open source projects in the loud and renting them out to users is a much, much better business model. It has much lower churn. Customers are much happier and the revenue growth is just massive.”
The SaaS rental model also protects Databricks core asset: it’s intellectual property (IP). But Databricks’ core IP doesn’t exist in the software projects that it sponsors, which is open for all the world to see. Instead, Databricks’ most valuable IP exists in the tools and techniques it builds and uses to monitor and manage its customers’ software on the cloud at massive scale. That’s not something that leaks out as easily as bits and bytes do in the classic open source model.
“In the cloud things are very different,” Ghodsi said. “In the cloud you are renting the service from Databricks. We are on the hook to make sure it’s secure, that it’s reliable, and it’s available. We carry pager duty in the middle of the night and we monitor this stuff and we make sure that it’s always up and running. We make sure it’s always upgraded at any given time to the latest version. We are the ones who are responsible for all these things, not the local IT team from the company that bought the software.”
Databricks uses open source software like Kubernetes to help it scale the various data engineering, analytics, and machine learning workloads that its customers pay it for. It also develops its own proprietary software that it uses to keep the cloud services humming along day and night.
“Running the services is hard,” Ghodsi said. “Running a service at scale is just very hard. Just to give you a sense, we launch over 1 million virtual machines per day on Amazon Web Services. Launching 1 million VMs is not easy. Making sure that works flawlessly and it’s monitored and it's secured and it's reliable is hard. That’s one of the reasons people pay us.”
Companies like Uber, Airbnb, and Amazon (the retailer) have invested hundreds of millions of dollars into building their own advanced data engineering and AI systems, which has given them a competitive edge in their respective markets. Now Ghodsi wants to help all the other companies get rich building their own differentiated AI-powered services — or at least die trying.
“They key thing is we don’t want customers to have to worry about this stuff,” he said. “We’ll run it for them. They don’t’ have to worry about installing and managing and versioning the software. We want them focused on the AI problem they have, the business problems they have. …But I don’t think 10 years from now you want big Global 5000 companies doing this. I don’t understand why a pharmaceutical company which is trying to come up with drugs to cure things like chronic liver disease should have to focus on Kubernetes and managing Kubernetes clusters and configuring them. That should be under the hood and that should be happening behind the scenes, and that’s what we do.”
As far as business models go, it might just be a winner.
This article originally appeared in sister publication Datanami.