Advanced Computing in the Age of AI | Wednesday, May 1, 2024

Cerebras Systems and Barcelona Supercomputing Center Train Industry-Leading Multilingual Spanish Catalan English LLM 

SUNNYVALE, Calif., Jan. 31, 2024 -- Cerebras Systems, a pioneer in accelerating generative AI, today announced that the Barcelona Supercomputing Center (BSC-CNS) has completed training FLOR-6.3B, the state-of-the-art English Spanish Calatan large language model. FLOR-6.3B was trained in just 2.5 days on Condor Galaxy (CG-1), the massive AI supercomputer built from 64 Cerebras CS-2s by Cerebras and G42. FLOR-6.3B continues Cerebras’ leading work on multilingual models, a trend that started with the introduction of Jais, the leading Arabic English model.

As Catalan has a fraction of the data that is typically needed to train a model, innovative AI training techniques were created. Catalan and Spanish are low and mid-resourced languages relative to English. As explained in a recent post, BSC sought to create a model that was stronger for having three languages together, as each language is commonly spoken in Spain. In partnership with Cerebras, the BSC team explored a technique that used a fully-trained LLM and adjusted the embedding layer to achieve the same result as if it were trained using a large data set.

“Even though Spanish is one of the most commonly spoken languages in the world, there is a shortage of data available on the Internet for training – and we’ve found this to be a common problem for many languages beyond English,” said Andrew Feldman, CEO and co-founder of Cerebras. “In collaboration with our partners, we have been committed to developing new methodologies for creating models where training data is underrepresented. We are proud to work with BSC on FLOR 6.3B, which is multilingual at its core and performs significantly better than competing Spanish LLMs thanks to our novel training techniques.”

FLOR is a new family of open-source models, ranging in size from 760M to 6.3B parameters, that are based on publicly released checkpoints of BLOOM. These checkpoints have been previously pre-trained on 341B tokens of multilingual data, including 46 natural languages and 13 coding languages.

Bloom-7.1B was taken as the initial checkpoint of the continuous pretraining due to its multilingual nature. To better adapt to Catalan and Spanish, a new tokenizer was trained and used in the continuous pretraining process. The new tokenizer has a reduced vocabulary set of 50,257 subwords, in which 66% were overlapping with the Bloom vocabulary set and the rest are subwords that are more prevalent in Catalan and Spanish. The reduction of the vocabulary size also resulted in FLOR-6.3B having fewer parameters than the Bloom-7.1B model which directly reduces the cost of doing inference by more than 10%.

The FLOR family of models were trained using subsets of the Condor Galaxy 1 AI Supercomputer. The smaller models were trained using single Cerebras CS-2 systems, while FLOR-6.3B was trained using 16 CS-2s. Cerebras completed the entire training of FLOR-6.3B on 140 billion tokens in 2.5 days. FLOR-6.3B is open source and available for use in both research and commercial applications.

Condor Galaxy is one of the largest AI supercomputers in the world. Build by Cerebras and its strategic partner G42, Condor Galaxy 1 is comprised of 64 CS-2 systems, creating a 4 Exaflop AI supercomputer, with standard support for up to 600 billion parameter models. Condor Galaxy 1 is simple to program and entirely avoids the complexity of distributed computing. This enables customers to train large, ground-breaking models quickly, greatly reducing the time from idea to trained model.

The FLOR family of models continues Cerebras’ leadership in multilingual models. In 2023, Cerebras and Core42 co-developed Jais 13B and Jais30B, the best bilingual Arabic models in the world, now available on Azure Cloud. Condor Galaxy has also been used to train BTLM-3B-8K, which is the number one leading 3B model on HuggingFace, offering 7B parameter performance in a light 3B parameter model for inference. Med42, developed with M42 and Core42, is a leading clinical LLM, trained on Condor Galaxy 1 in a weekend and surpassing MedPaLM on performance and accuracy.

For more information on Condor Galaxy AI supercomputer, please visit https://www.cerebras.net/condor-galaxy-1.

About Cerebras Systems

Cerebras Systems is a team of pioneering computer architects, computer scientists, deep learning researchers, and engineers of all types. We have come together to accelerate generative AI by building a new class of computer system. Our flagship product, the CS-2 system, is powered by the world’s largest and fastest AI processor, our Wafer-Scale Engine. It makes training large models simple and easy by avoiding the complexity of distributed computing. Cerebras CS-2s are clustered together to make the largest AI supercomputers in the world, which are used by leading corporations for proprietary models, and to train open-source models with millions of downloads. Cerebras solutions are available through the Cerebras Cloud and on premise. For further information, visit https://www.cerebras.net.


Source: Cerebras

EnterpriseAI