Advanced Computing in the Age of AI | Saturday, July 20, 2024

Spelunking the HPC and AI GPU Software Stacks 

As AI continues to reach into every domain of life, the question remains as to what kind of software these tools will run on. The choice in software stacks – or collections of software components that work together to enable specific functionality on a computing system – is becoming even more relevant in the GPU-centric computing needs of AI tasks.

With AI and HPC applications pushing the limits of computational power, the choice of software stack can significantly impact performance, efficiency, and developer productivity.

Currently, there are three major players in the software stack competition: Nvidia's Compute Unified Device Architecture (CUDA), Intel's oneAPI, and AMD's Radeon Open Compute (ROCm). While each has pros and cons, Nvidia's CUDA continues to dominate largely because its hardware has led the way in HPC and now AI.

Here, we will delve into the intricacies of each of these software stacks – exploring their capabilities, hardware support, and integration with the popular AI framework PyTorch. In addition, we will conclude with a quick look at two higher-level HPC languages: Chapel and Julia.

Nvidia's CUDA

Nvidia's CUDA is the company's proprietary parallel computing platform and software stack meant for general-purpose computing on their GPUs. CUDA provides an application programming interface (API) that enables software to leverage the parallel processing capabilities of Nvidia GPUs for accelerated computation.

CUDA must be mentioned first because it dominates the software stack space for AI and GPU-heavy HPC tasks – and for good reason. CUDA has been around since 2006, which gives it a long history of third-party support and a mature ecosystem. Many libraries, frameworks, and other tools have been optimized specifically for CUDA and Nvidia GPUs. This long-held support for the CUDA stack is one of its key advantages over other stacks.

Nvidia provides a comprehensive toolset as part of the CUDA platform, including CUDA compilers like Nvidia CUDA Compiler (NVCC). There are also many debuggers and profilers for debugging and optimizing CUDA applications and development tools for distributing CUDA applications. Additionally, CUDA's long history has given rise to extensive documentation, tutorials, and community resources.

CUDA's support for the PyTorch framework is also essential when discussing AI tasks. This package is an open-source machine learning library based on the Torch library, and it is primarily used for applications in computer vision and natural language processing. PyTorch has extensive and well-established support for CUDA. CUDA integration in PyTorch is highly optimized, which enables efficient training and inference on Nvidia GPUs. Again, CUDA's maturity means access to numerous libraries and tools that PyTorch can use.

In addition to a raft of accelerated libraries, Nvidia also offers a complete deep-learning software stack for AI researchers and software developers. This stack includes the popular CUDA Deep Neural Network library (cuDNN), a GPU-accelerated library of primitives for deep neural networks. CuDNN accelerates widely used deep learning frameworks, including Caffe2, Chainer, Keras, MATLAB, MxNet, PaddlePaddle, PyTorch, and TensorFlow.

What's more, CUDA is designed to work with all Nvidia GPUs, from consumer-grade GeForce video cards to high-end data center GPUs – giving users a wide range of versatility within the hardware they can use.

That said, CUDA could be better, and Nvidia's software stack has some drawbacks that users must consider. To begin, though freely available, CUDA is a proprietary technology owned by Nvidia and is, therefore, not open source. This situation locks developers into Nvidia's ecosystem and hardware, as applications developed on CUDA cannot run on non-Nvidia GPUs without significant code changes or using compatibility layers. In a similar vein, the proprietary nature of CUDA means that the software stack's development roadmap is controlled solely by Nvidia. Developers have limited ability to contribute to or modify the CUDA codebase.

Developers must also consider CUDA's licensing costs. CUDA itself is free for non-commercial use, but commercial applications may require purchasing expensive Nvidia hardware and software licenses.


AMD's ROCm is another software stack that many developers choose. While CUDA may dominate the space, ROCm is distinct because it is an open-source software stack for GPU computing. This feature allows developers to customize and contribute to the codebase, fostering collaboration and innovation within the community. One of the critical advantages of ROCm is its support for both AMD and Nvidia GPUs, which allows for cross-platform development.

This unique feature is enabled by the Heterogeneous Computing Interface for Portability (HIP), which gives developers the ability to create portable applications that can run on different GPU platforms. While ROCm supports both consumer and professional AMD GPUs, its major focus is on AMD's high-end Radeon Instinct and Radeon Pro GPUs designed for professional workloads.

Like CUDA, ROCm provides a range of tools for GPU programming. These include C/C++ compilers like the ROCm Compiler Collection, AOMP, and AMD Optimizing C/C++ Compiler, as well as Fortran Compilers like Flang. There are also libraries for a variety of domains, such as linear algebra, FFT, and deep learning.

That said, ROCm's ecosystem is relatively young compared to CUDA and needs to catch up regarding third-party support, libraries, and tools. Being late to the game also translates to more limited documentation and community resources compared to the extensive documentation, tutorials, and support available for CUDA. This situation is especially true for PyTorch, which supports the ROCm platform but needs to catch up to CUDA in terms of performance, optimization, and third-party support due to its shorter history and maturity. Documentation and community resources for PyTorch on ROCm are more limited than those for CUDA. However, AMD is making progress on this front.

Like Nvidia, AMD also provides a hefty load of ROCm libraries. AMD offers an equivalent to cuDNN called MIOpen for deep learning, which is used in the ROCm version of PyTorch (and other popular tools).

Additionally, while ROCm supports both AMD and Nvidia GPUs, its performance may not match CUDA when running on Nvidia hardware due to driver overhead and optimization challenges.

Intel's oneAPI

Intel's oneAPI is a unified, cross-platform programming model that enables development for a wide range of hardware architectures and accelerators. It supports multiple architectures, including CPUs, GPUs, FPGAs, and AI accelerators from various vendors. It aims to provide a vendor-agnostic solution for heterogeneous computing and leverages industry standards like SYCL. This feature means that it can run on architectures from outside vendors like AMD and Nvidia as well as on Intel's hardware.

Like ROCm, oneAPI is an open-source platform. As such, there is more community involvement and contribution to the codebase compared to CUDA. Embracing open-source development, oneAPI supports a range of programming languages and frameworks, including C/C++ with SYCL, Fortran, Python, and TensorFlow. Additionally, oneAPI provides a unified programming model for heterogeneous computing, simplifying development across diverse hardware.

Again, like ROCm, oneAPI has some disadvantages related to the stack's maturity. As a younger platform, oneAPI needs to catch up to CUDA regarding third-party software support and optimization for specific hardware architectures.

When looking at specific use cases within PyTorch, oneAPI is still in its early stages compared to the well-established CUDA integration. PyTorch can leverage oneAPI's Data Parallel Python (DPPy) library for distributed training on Intel CPUs and GPUs, but native PyTorch support for oneAPI GPUs is still in development and is not ready for production.

That said, it's important to note that oneAPI's strength lies in its open standards-based approach and potential for cross-platform portability. oneAPI could be a viable option if vendor lock-in is a concern and the ability to run PyTorch models on different hardware architectures is a priority.

For now, if maximum performance on Nvidia GPUs is the primary goal for developers with PyTorch workloads, CUDA remains the preferred choice due to its well-established ecosystem. That said, developers seeking vendor-agnostic solutions or those primarily using AMD or Intel hardware may wish to rely on ROCm or oneAPI, respectively.

While CUDA has a head start regarding ecosystem development, its proprietary nature and hardware specificity may make ROCm and oneAPI more advantageous solutions for certain developers. Also, as time passes, community support and documentation for these stacks will continue to grow. CUDA may be dominating the landscape now, but that could change in the years to come.

Abstracting Away the Stack

In general, many developers prefer to create hardware-independent applications. Within HPC, hardware optimizations can be justified for performance reasons, but many modern-day coders prefer to focus more on their application than on the nuances of the underlying hardware.  

PyTorch is a good example of this trend. Python is not known as a particularly fast language, yet 92% of models on Hugging Face are PyTorch exclusive. As long as the hardware vendor has a PyTorch version built on their libraries, users can focus on the model, not the underlying hardware differences. While this portability is nice, it does not guarantee performance, which is where the underlying hardware architecture may enter the conversation.

Of course, Pytorch is based on Python, the beloved first language of many programmers. This language often trades ease of use for performance (particularly high-performance features like parallel programming). When HPC projects are started with Python, they tend to migrate to scalable high-performance codes based on distributed C/C++ and MPI or threaded applications that use OpenMP. These choices often result in the "two language" problem, where developers must manage two versions of their code.

Currently, two "newer" languages, Chapel and Julia,  offer one easy-to-use language solution that provides a high-performance coding environment. These languages, among other things, attempt to "abstract away" many of the details required to write applications for parallel HPC clusters, multi-core processors, and  GPU/accelerator environments. At their base, they still rely on vendor GPU libraries mentioned above but often make it easy to build applications that can recognize and adapt to the underlying hardware environment at run time.


Initially developed by Cray, Chapel (the Cascade High Productivity Language) is a parallel programming language designed for a higher level of expression than current programming languages (read as "Fortran/C/C++ plus MPI"). Hewlett Packard Enterprise, which acquired Cray, currently supports the development as an open-source project under version 2 of the Apache license. The current release is version 2.0, and the Chapel website posts some impressive parallel performance numbers.

Chapel compiles to binary executables by default, but it can also compile to C code, and the user can select the compiler. Chapel code can be compiled into libraries that can be called from C, Fortran, or Python (and others). Chapel supports GPU programming through code generation for Nvidia and AMD graphics processing units.

There is a growing collection of libraries available for Chapel. A recent neural network library called Chainn is available for Chapel and is tailored to build deep-learning models using parallel programming. The implementation of Chainn in Chapel enables the user to leverage the parallel programming features of the language and to train Deep Learning models at scale from laptops to supercomputers.


Developed at MIT, Julia is intended to be a fast, flexible, and scalable solution to the two-lanague problem mentioned above. Work on Julia began in 2009, when Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman set out to create an open technical computing language that was both high-level and fast.

Like Python, Julia provides a responsive interpretive programming environment (REPL or read–eval–print loop) using a fast, just-in-time compiler. The language syntax is similar to Matlab and provides many advanced features, including:

  • Multiple dispatch: a function can have several implementations (methods) depending on the input types (easy-to-create portable and adaptive codes)
  • Dynamic type system: types for documentation, optimization, and dispatch
  • Performance approaching that of statically typed languages like C.
  • A built-in package manager
  • Designed for parallel and distributed computing  
  • Can compile to binary executables

Julia also has GPU libraries for CUDA, ROCm, OneAPI, and Apple that can be used with the machine learning library Flux.jl (among others). Flux is written in Julia and provides a lightweight abstraction over Julia's native GPU support.

Both Chapel and Julia offer a high-level and portable approach to GPU programming. As with many languages that hide the underlying hardware details, there can be some performance penalties. However, developers are often fine with trading a few percentage points of performance for ease of portability.