Advanced Computing in the Age of AI | Thursday, April 18, 2024

Lightning AI Today Announces ‘Thunder’ to Speed Up AI Model Training 

Lightning AI today announced the availability of Thunder, which it hails as a new and powerful source-to-source compiler for PyTorch. This tool is designed for training and serving the latest generative AI models across multiple GPUs at maximum efficiency.

The company behind PyTorch Lightning, Lightning AI created Thunder over the course of two years of research into deep learning compilers in collaboration with NVIDIA. The problem Lightning AI was working to solve here was the vast amount of resources organizations put towards large language model (LLM) training, which can often balloon into billions of dollars. Most developers simply do not have the same high-performance optimization and profiling tools that large companies like Open AI or Google do, and Thunder is meant to help keep costs down.

Although there is no set starting size for LLMs, most models typically have at least one billion parameters. What’s more, the sizes of these models are growing at a rapid rate. OpenAI’s GPT-3 LLM has 175 billion parameters, while its successor GPT-4 has 1.76 trillion. Thus, optimizing this work is of the utmost importance.

According to a press release from Lightning AI, Thunder achieves higher speeds for training LLMs compared to unoptimized code. The company states that these efficiencies save weeks of training and substantially lower training costs.

With the audacious objective of developing the next-generation deep learning system for PyTorch that could make use of the greatest executors and software, such as torch.compile, NVIDIA's NVFuser, Apex, cuDNN, and OpenAI's Triton, Lightning AI recruited a team of knowledgeable PyTorch developers in 2022. To allow each executor to perform the mathematical operations for which it was most appropriately constructed, Thunder was intended to enable developers to employ all executors simultaneously. Thunder is being integrated with the assistance of NVIDIA's best executors.

“What we are seeing is that customers aren’t using available GPUs to their full capacity. They are throwing more GPUs at the problem which provides diminishing returns with current software.” says Luca Antiga, Lightning’s CTO, in a press release. “Thunder, combined with Lightning Studios and its profiling tools, allows customers to effectively utilize their GPUs as they scale their models to be larger and run faster.”

Dr. Thomas Viehmann is leading the Thunder team. He is described by Lightning AI as a pioneer in the deep learning field best known for his early work on PyTorch, his contributions to TorchScript, and for making PyTorch available on mobile devices.

“I couldn't be more thrilled for Lightning to lead the next wave of performance optimizations to make AI more open source and accessible. I’m especially excited to partner with Thomas, one of the giants in our field to lead the development of Thunder,” says Lightning AI CEO and founder Will Falcon. “Thomas literally wrote the book on PyTorch. At Lightning AI he will lead upcoming performance breakthroughs we will make available to the PyTorch and Lightning AI community.”

Currently, top teams at Open AI, Meta AI, NVIDIA, and Lightning AI are the only ones able to accelerate AI workloads using Thunder. But with Thunder, the open-source community will have access to the extremely specific optimizations found by these expert programmers.

Thunder will be released under an Apache 2.0 license, with no restrictions. First-rate Thunder support and native profiling tools will be provided by Lightning Studios, making it simple for developers and researchers to identify GPU memory and performance limitations.

About Lightning AI

Lightning AI is the company behind PyTorch Lightning, the deep learning framework of choice for developers and companies seeking to build and deploy AI products. Focusing on simplicity, modularity, and extensibility, Studio, its flagship product, streamlines AI development and boosts developer productivity. Its aim is to enable individual developers and enterprise users to build deployment-ready AI products.


For more information, visit: