Language Model Training Gets Another Player: Inspur AI Research Unveils Yuan 1.0
Hot on the heels of recent GPT language model product news from SambaNova, Microsoft and Nvidia, China-based Inspur AI Research has announced its Yuan 1.0 language model, which has 245.7 billion parameters and has undergone training using 5TB of datasets.
What makes Yuan 1.0 different, however, is that it was built from the ground up as a model for the Chinese language, which is complex and required a unique development approach compared to English, according to an Oct. 21 announcement by Inspur AI Research. Yuan 1.0 was previously unveiled Sept. 28 at a large-scale AI model workshop in Beijing, but the company is now announcing its creation to a broader audience.
Inspur said that the parallel GPU computing used in Yuan 1.0 and its impressive performance in zero-shot and few-shot learning gives it the ability to generate language content that is difficult to distinguish from human-generated content.
To make Yuan 1.0 possible, a large-scale distributed training system was built to provide the needed processing power, incorporating 2,128 GPUs across its fundamental design architecture.
In an accompanying paper written about the project – Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning – researchers said that Yuan 1.0 has a different approach from the OpenAI GPT-3 project, "which requires a huge amount of computational resources which makes it challengeable to researchers."
The large-scale distributed training in Yuan 1.0 combines three different parallelism strategies: tensor, pipeline, and data parallelism. To maximize the effectiveness of using computational resources, the model considers parameters that will offer optimal results and prioritizes computational resources to these parameters, according to Inspur AI Research. These architecture optimizations allow 245.7 billion parameters to be calculated using 4,095 Petaflop-days of processing power while only sustaining a training loss of 1.64.
In addition, Inspur developed a Massive Data Filtering System (MDFS) built upon Spark to clean and filter raw data and train a BERT-based model to select high quality text samples. MDFS consists of three stages: data collection, coarse filtering and fine tuning. MDFS built the 5 TB corpus used by Yuan 1.0 by filtering 850 TB of raw data collected from the internet. This was achieved by running MDFS on a high performance cluster with 36 nodes. The resulting corpus is the largest high-quality Chinese corpus in the world, according to Inspur.
A spokesperson from Inspur could not be reached by press time today.
Dan Olds, chief research officer for Intersect360 Research, told EnterpriseAI that the Yuan 1.0 project is a significant accomplishment based on its high performance especially because it works with the Chinese language.
"According to reports and the paper they produced, Yuan 1.0 scored almost 20 percent better on Chinese language benchmarks and took home the top spot in six categories, such as noun-pronoun relationships, natural language inference, and idiom reading comprehension," said Olds. "It is a huge computational problem to do this with any language, but Chinese is even more difficult because of the complexity of the language and the lack of a large-scale and accurate Chinese corpus [a structured set of machine readable texts]. In the process of creating Yuan 1.0, Inspur build the most comprehensive Chinese language corpus in the world, more than twice the size of the largest existing Chinese corpus, and used all 5TB of it to train this new model."
Olds called Yuan 1.0 "quite an advance over [OpenAI's] GPT-3 in that Yuan 1.0 uses 245 billion parameters vs. GPT-3 at 145 billion. This means that the Yuan 1.0 model is much more sophisticated when it comes to handling more complex language structures, sentence comprehension and the like," said Olds. "What is even more impressive is that Yuan 1.0 was able to increase performance over GPT-3 while using less hardware. GPT-3 training typically requires a cluster with 10,000 GPU processors, but Yuan 1.0 was able to train their model in a reasonable time using only 2,128 GPUs. This is due to Inspur radically reducing bottlenecks in their training code."
Olds said he expects further improvements on the performance of Yuan 1.0 as Inspur improves the code.
"This is a very impressive debut for the new model, and it might just have a large enough performance lead to become the Chinese NLP standard for at least the foreseeable future," said Olds. "There is a sizable business opportunity for someone who corners the market on Chinese NLP, but other large players are working on this problem as well, so Inspur needs to keep pushing the boundaries of NLP."
Another analyst, Jack E. Gold, the president of J. Gold Associates, LLC, agreed that Yuan 1.0 is a major milestone in terms of what Inspur was able to model, but argued that it still took a lot of resources that might only be available to researchers and not to enterprise GPT users.
"While this might be possible for a research organization without major constraints, the ability of a commercial operation to do something similar would be cost prohibitive," said Gold. "One thing they did not highlight is how much this modeling effort cost in data resources and compute resources. I would bet it is quite a lot."
While the work shows that major AI models can be built for difficult problems like the Chinese language, "mass-scale deployment of AI requires a tradeoff between the cost of doing something and the return on that investment," said Gold. "That is what companies look at. This illustrates that there is still a lot of work to be done to make large AI problems capable of being modeled on a reasonable amount of hardware resources and the need for more efficient modeling frameworks and hardware acceleration."
Karl Freund, the founder and principal analyst at Cambrian AI Research, said he finds it notable that there have been two other significant language model announcements in the last several weeks, including the Oct. 11 news about Nvidia and Microsoft collaborating on a 530 billion parameter model that trained using 4,480 Nvidia A100 GPUs, and SambaNova's Oct. 18 news that it has incorporated GPT into its flagship Dataflow-as-a-Service platform for enterprise users.
"There is a lot of interest in big models, but we should expect a series of similar announcements for a while, approaching 1 trillion parameters," said Freund. "But soon, it will take a different hardware and software approach, something like Nvidia Grace or Cerebras MemoryX, to scale to 100 trillion parameters for brain-scale AI."
Ultimately, though, one must ask if there is a market for these innovations, he said. "We think so, but it is just emerging," said Freund. "The models to date are error-prone and can promote bias and misinformation. So, the use of these models remains a bit scary."