Advanced Computing in the Age of AI | Saturday, April 27, 2024

DataChat Delivers Data Exploration with a Dose of GenAI 

(SomYuZu/Shutterstock)

What if you could tell the computer how you want to explore a data set, and the computer would automatically execute the analysis and deliver you the results? That’s the idea behind DataChat, a generative AI-based data exploration and analytics tool that spun out of a University of Wisconsin-Madison research project and is now a commercial product.

Jignesh Patel, who is currently a computer science professor at Carnegie Mellon University and a co-founder of DataChat, recently sat down virtually with Datanami to chat about the nature of data exploration in the generative AI era and the new DataChat offering, which formally launched earlier this month at the Gartner Data & Analytics Summit.

The impetus for creating DataChat started back in 2016, when Patel was working as a computer science professor at University of Wisconsin-Madison and the CTO of Pivotal (now a part of VMware Tanzu and parent company Broadcom). The big data explosion was in full swing, Hadoop was the rallying point for new distributed frameworks, and data scientists were in big demand.

While the technology was evolving quickly, too many companies were spinning their tires when it came to data analytics and exploration, and Patel sensed that something was missing from the equation.

“Every CTO, their first objective was to hire an army of data scientists. They couldn’t get enough of data scientists,” Patel said. “And what I had started to observe in the very early days is the way data scientists work. It’s all ad-hoc analytics. It’s unscripted, as opposed to the BI world, and you’re trying to get something from data in a non-linear path.”

Much of this data exploration work was done in a manual fashion, using tools like Jupyter data science notebooks. Data scientists would explore a particular data set until something interesting popped out, then figure out a way to extract that particularly piece of data, transform it into a more useful form, then pipe it into a machine learning algorithm, where it could be used in an application.

Data scientists are constantly in short supply (pathdoc/Shutterstock)

Data scientists are constantly in short supply (pathdoc/Shutterstock)

Patel recognized the pattern lent itself to some form of automation, one that was preferably more approachable by non-experts.

“Literally the way they were doing this is breaking the problem down, step by step, then trying to find code somewhere on the Web, and retrofit it inside. And that’s how a lot of cells get constructed in notebooks,” he said. “So we wrote a paper in 2017 to say, what if we could have this data science cell be filled up by the user just expressing that in natural language?”

This was pre ChatGPT days, of course, and the state of the art in natural language processing (NLP) was nowhere near what it is today. While the NLP tech would improve, Patel and his University of Wisconsin PhD graduate student, Rogers Jeffrey Leo John, did the hard work of constructing a compact control language that could sit between the user and the underlying SQL and Python code that would query data and call machine learning algorithms, respectively.

“The intermediate [language]… was great because now we could take any arbitrary language, convert that into that intermediate language, and now convert that into SQL and Python,” Patel said. “Because that’s what you need to do if you’re talking to a SQL database, doing ETL. If you want to build machine learning models, you really have to cross the two main languages of data science, which is SQL and Python.”

A Natural Language for Data Science

The goal with DataChat was to create a data analytics and exploration tool that could follow simple English instructions, reducing the need for users to know SQL or Python to be productive with data. Users are able to type in simple commands such as “create a visualization for customer churn,” and the product will automatically produce a visualization based on the data.

Jignesh Patel is a DataChat co-founder and a computer science professor at Carnegie Mellon

Jignesh Patel is a DataChat co-founder and a computer science professor at Carnegie Mellon

The idea is for DataChat to be interactive, with a natural flow, Patel said. Sitting behind a spreadsheet-like interface, users can fire off questions at the data. Not every question posed to DataChat is going to immediately generate a reliable answer. But the give and take allows the product and the user to move forward in a predicable fashion.

“You ask and you get,” Patel said. “And when you get something back, we also tell you the steps. There’s a give and take. I’m going to ask you something, it didn’t make sense, and you ask in a slightly different way, but I’m making progress at every step.”

Business users, data analysts, and data scientists are the targeted users for DataChat. For business users and data analysts, the goal is to elevate their skills into the data science realm without a lot of training. Data scientists will often use DataChat just to give them an idea of what’s in a new data set.

“They might just be poking at it DataChat and saying ‘Hey, how many null values do I have in three of my critical columns?’” Patel said. “Instead of writing a SQL query, they just point, click, or ask, and get that answer, and it’s just much faster. They could write it, but they’re getting the benefit of time from using this.”

A DataChat workflow can generate three artifacts from data sitting in anything from an Excel workbook to a data warehouse in Databricks or Snowflake: a report, a chart, or a machine learning model, including regression, classification, and time-series. Each workflow will be accompanied by an explanation of how and why it generated the answer that it did, which is an important feature of the product, Patel said.

For a model on churn, DataChat won’t generate “some crazy technical answer,” he said. “But it’s going to say, ‘Okay, these three things–the age of the person, the contract type and whether they have bought insurance or not. And this is 60% of the influence or 20% and 10%, and here the things that it’s not influencing based on the data.’”

That level of transparency is critical in data science, Patel said. “From day one, we’ve been thinking about solving data science, and science requires transparency, so that’s built into the philosophy of the product,” he said.

The Shifting Grounds of NLP

DataChat was first registered as a company in 2017, and raised $4 million in a seed round in 2020 (it has since raised another $25 million). At that time in 2017, Patel and John slogged their way forward with the NLP technology of the day, which wasn’t nearly as powerful nor easy to use as today’s large language models (LLMs).

The DataChat interface lets users explore data using natural language (Image courtesy DataChat)

The DataChat interface lets users explore data using natural language (Image courtesy DataChat)

They built language parsers and delved into semantic understanding, “all of that crazy stuff,” Patel said. “But as part of doing that, we built the rest of the bottom of the stack,” he continued. “So important layers were all ready. They were scalable, they were cost-optimized, especially for cloud databases.”

When the LLM revolution exploded onto the scene a few years later, Patel and John quickly realized the superiority of the new approach, and jettisoned the top of the stack built on now-outdated NLP techniques. They replaced it with OpenAI’s Codex. When OpenAI killed Codex a year ago, they pivoted again to make the LLM component swappable in their stack.

“So obviously that was hell for us, but as part of doing that we redid our engineering framework in the LLM piece to make sure that next time that happens to us, we can plug and play LLMs out and make it as painless as possible,” Patel said.

Today the company relies primarily on OpenAI’s GPT-4, which is generally considered to be the most powerful and well-read LLM on the market today. DataChat employs GPT-4 to learn and generate DataChat’s intermediate language. GPT-4 is told about the type of data that the user wants to analyze in general terms, but customers’ actual data never touches GPT-4, Patel said.

“We will construct summaries of what is the structure of the schema, so we say ‘Here are the elements,’” Patel said. “I don’t need to give [GPT-4] the actual data values.”

LLMs are non-deterministic machines that can’t be fully trusted, Patel said, which is why DataChat uses LLMs only as “guides.” “They hallucinate, they do wrong stuff,” he said. “So they just give us stuff, we will convert that query to an intermediate language…and what we will generate for you is completely deterministic.”

A user can take a workflow generated by DataChat from one piece of data and run it on another piece of data, and it would run in the exact same way, he said. “So there’s no ambiguity.”

It’s been a long road for Patel and John, but the Madison, Wisconsin-based company is finally accepting orders for DataChat. After being formally launched at the Gartner show, Patel is ready to see what the next chapter in his fourth startup will bring.

When we started and wrote that initial paper, everyone thought it was crazy in the database world,” Patel said. “But we got, in some sense, lucky that the GenAI piece landed where it was now a lot more usable. But that’s the fun thing about technology: It moves around, and if you’re willing to move around with it, good things can happen.”

Related Items:

GenAI Doesn’t Need Bigger LLMs. It Needs Better Data

Are We Underestimating GenAI’s Impact?

Top 10 Challenges to GenAI Success

EnterpriseAI