Companions, Not Replacements: Chatbots Will Not Succeed Search Engines Just Yet
The internet has been abuzz with news about the AI-powered search engine wars between Google and Microsoft. Microsoft has been collaborating with OpenAI to integrate an upgraded version of ChatGPT within its Bing search engine. In response, Google announced Bard, a competing chatbot that it will incorporate with its popular search engine.
The conspicuous nature of these tech giants battling for chatbot supremacy is building excitement for innovation in the field of search. As the hype train keeps rolling, it has prompted the question: Will chatbots actually replace search engines? Some experts are skeptical of this idea due to the inherent flaws in the technology in its current form.
The underlying large language models (LLMs) powering these chatbots, like OpenAI’s GPT-3.5 and Google’s LaMDA, are trained with massive datasets of text aggregated from the open internet. While chatbots have the advantage of delivering answers in a humanlike, conversational format, they often deliver inaccurate or totally fabricated information. Unlike the list of verifiable links delivered by a search engine that allows a user to evaluate sources, chatbots do not cite sources of information and give no evidence for their conclusions.
“These things understand language. They don’t understand truth,” said Jonathan Rosenberg, CTO and head of AI at Five9, in an interview with Datanami.
Rosenberg says LLMs are one of the most transformative technologies in the last 10-20 years, and it is hard to estimate their future impact. But he has his reservations due to the technology’s current limitations: “It’s taught to understand how to construct language, and that language is from the entire corpus of the internet. So, the things it can speak about just represent what it has encountered on the internet, composing this together in ways that make sense as a language. And that means it’s going to get stuff wrong,” he said.
Chatbots are plagued by inaccuracies and hallucinations. Both Bing and Bard have already made high-profile mistakes. Google lost $100 billion of its market value in one day after Bard inaccurately claimed the James Webb Telescope was the first to photograph exoplanets. In a recent demo, Bing analyzed earnings reports from Gap and Lululemon, but when its answers were compared to the actual reports, it was shown that the chatbot mixed up some of the numbers, and in some cases, completely made them up.
“One of the major challenges with these large language models is that they can’t say ‘I don’t know’ and instead produce a response they deem accurate. And often, the response falls short of the truth,” said Alexander Ratner, co-founder and CEO of Snorkel AI, in an email interview with Datanami. “This is because, in the end, the model is as good as the data it is trained on, and a lot of the data we produce in the world is unstructured—meaning, it’s unlabeled and unclassified. And a model isn’t discerning like a human; it can’t tell the difference between good data versus, say, data containing toxic content. This is also one of the major reasons why large language models need a lot more data curation than search.”
Data curation is the process of creating, organizing, and maintaining data sets to make data usable for a specific purpose. Since the largest LLMs have been trained with general, uncurated datasets, they are not currently ideal for finding domain-specific information, through indexing and extracting from keywords, like a search engine does.
Experts at Google explored the question of whether LLMs could one day replace search engines in 2021 with a research paper, noting that pre-trained language models are currently dilettantes holding surface-level knowledge rather than domain experts. The paper proposed a unified model approach where an LLM model retrieves and ranks queries with the same component: “To accomplish this, a so-called model-based information retrieval framework is proposed that breaks away from the traditional index-retrieve-then-rank paradigm by encoding the knowledge contained in a corpus in a consolidated model that replaces the indexing, retrieval, and ranking components of traditional systems,” the paper’s conclusion states.
While a unified model sounds promising, a lot of research and work needs to be done before an LLM can handle all search tasks on its own. “It’s understandable why people are excited. They think [chatbots] could be a truly disruptive opportunity because they could replace not just the interface of search, but also the business model. And they are not wrong; but they are miscalculating the time it will take to make this really operational,” said Ratner.
Ratner says these chatbots will remain companions for the foreseeable future, not replacements: “The next big reality check is companies will quickly realize there’s a lot of work to make these models work for actual enterprise business use cases. Out of the box, large language models give low-quality outputs on proprietary data and anything beyond simple tasks,” he said. “Enterprises on the other hand have to contend with small, noisy, and non-AI-ready datasets, and require AI solutions to solve complex tasks. The main challenge for enterprises when adopting AI comes down to: developing training data.”
For these new AI tools to be truly usable for domain-specific search, the foundation models powering them will require more efficient ways of fine-tuning and data labeling, which is what Ratner’s company, Snorkel AI, specializes in. Simply hoovering up the internet’s unstructured textual data to use as training data for LLMs is not an ideal approach for ensuring long-term data quality.
“Data quality plays a fundamental role in building and adapting models for production use. And data-centric workflows are critical to achieving transparency, auditability, and governance for AI applications,” said Ratner.
Rosenberg also believes that chatbots will help search but will not replace search. He sees the technology as an accelerator to what humans can do, but human supervision is necessary.
“I think there’s room for innovation in how you combine what [a chatbot] says with traditional search results,” he said. “This is the ultimate curation machine that generates much closer to the thing you want than the copy/paste of old, but it still needs human supervision to correct it, fix it, verify it, and make it your own. And so, you have to take it and add human oversight.”
At Five9, Rosenberg and his colleagues offer a cloud-based call center platform that enables agents to communicate with customers via phone, email, and chat. For contact centers, one possible use for chatbots would be assisting customer service agents, not just in finding information, but with anything language intensive, such as processing call transcripts for quality assurance: “Basically, in any industry where people are in the business of processing written language and responding with written language, it’s going to be impactful,” he said.
Rosenberg agrees that getting past current hallucination rates must involve curating the training set and achieving better data quality. When asked about the possibility of an endless feedback loop of AI-generated content riddled with inaccuracies, Rosenberg is hopeful that there will be more content with human oversight than not, but the underlying technology of LLMs may continue to be an issue: “I think there’s always going to be some risk, because fundamentally, a large language model is doing the most simple, stupidest thing you could ever imagine. All it’s doing is saying, given the prior X number of words, what is my highest probability for the next word?”
Ratner is also concerned about the possibility of continued data quality issues for chatbots: “Given most large models such as ChatGPT are trained on data from the internet, the AI-generated content creates a vicious circle of increasing degrading data quality leading to less accurate outcomes and slowing down mass adoption.”
This article first appeared on sister publication Datanami.