Advanced Computing in the Age of AI | Thursday, March 28, 2024

How HPC+AI Can Play in the Renewable Revolution 

Wind and solar are now the least expensive means to produce electricity[1]. As a result, tens of gigawatts of new renewable energy have come online in the last decade, and tens of gigawatts will come online in the next decade[2].

Wind and solar energy have two significant challenges that must be overcome in order to incorporate them at scale into the electrical grid: variability and congestion. Variability is obvious. The wind does not always blow and the sun does not always shine. Congestion refers the fact that electricity has to “travel” from where it is produced to where it is consumed, and that often the wires are at capacity and cannot carry all of the power available to where it is needed. In other words, renewable energy production is not proximate to the population centers where most energy is consumed. This leaves terawatt-hours of energy stranded due to insufficient transmission capacity.

What is needed are industries that can rapidly, and under control,  vary their electrical load (to deal with variable generation) and can operate in sparsely populated regions with very little infrastructure near where the energy is produced. High Performance Computing/High Throughput Computing, with the right additions, can be just that industry. Let’s break it down into infrastructure requirements and controllable variable load.

Anybody who has been in a lights-out data center knows that the number of people in most data centers is not very large. For the most part, machines are remotely – and indeed mostly automatically – managed. There is no need for hundreds of skilled people on site. When nodes do fail, automated management systems can identify nodes, switches, etc., that need to be swapped out. The disabled equipment can either be diagnosed and replaced locally or sent to an urban area for repair or final disposition. All that is really needed from an infrastructure perspective is power and high-speed networking.

The more interesting issue is electrical load management. In order to reduce load, one is going to need to reduce node power consumption drastically (possibly to near-zero). How? Assume for the moment that one could checkpoint and restart arbitrary programs, and that the programs are batch applications that do not interact with humans. For example, the individual training jobs in a hyper-parameter tuning job set. Check-pointing a job running on a machine allows us to turn the machine off without loss of state, and thus reduce our electrical load.

The ability to checkpoint/restart gives us the ability to migrate running jobs in time and space. By migrate in time, we mean we migrate the job execution to a later point in time as described above. That’s obvious. To migrate jobs in space we take the job’s checkpoint state (including file system information), move it to another site that has idle resources and more power, and restart the job at the other site.

Using migration in time and space one can ramp the electrical load up and down at any given site at will. This in turn turns data centers into electrical grid balancing machines, soaking up energy when it is abundant and releasing energy back into the grid when the grid is stressed and the energy is needed to keep the grid stable.

There are potential downsides to migrating in space and time, in particular reduced hardware utilization and longer execution times. With respect to reduced hardware utilization, the thing to keep in mind is that with careful load management and co-location with other, less valuable per MWh loads, one can expect 90%-95% up-time per node at a site. This drives capital costs per core/GPU hour a bit, but is compensated somewhat by the much lower power costs and lower carbon emissions.

With respect to execution time. If one focuses on HTC, as we have done, there need not be any impact on job set completion time. For a typical HTC job, like a data parallel job, the user does not typically focus on how long it takes to complete each individual task, rather the focus is on how long the set of tasks takes to complete. Modeling nodes that are down 10% of the time due to power availability as a node that is 10% slower, we simply allocated 11% more nodes to the job set in order to realize the same completion time. This is analogous to data parallelism in GPUs; the individual cores are slower – there are just a lot more of them.

To summarize. Placing data centers in congested load zones, proximate to renewable energy resources provides three clear advantages: First, shifting computation from regions where the marginal MWh is likely generated by fossil fuels to congested renewable rich regions where the marginal MWh is near carbon-free reduces the overall carbon load of computation by hundreds of kilograms of CO2 per MWh. Second, the ability rapidly ramp load up and down helps provide valuable “balance” to the grid, mitigating renewable energy variability risks to the grid. Third, providing a market for excess renewable energy vastly improves the financial calculus for further renewable projects in otherwise congested load zones. This increases not only the total amount of renewable energy available, but also the amount of renewable energy available on the least productive energy days, e.g., when the wind and sun are both weak, reducing the need to spin up fossil fuel generation to meet society’s basic energy needs.

[1] https://www.weforum.org/agenda/2021/07/renewables-cheapest-energy-source/.

[2] Globally: https://www.irena.org/newsroom/pressreleases/2021/Jun/Majority-of-New-Renewables-Undercut-Cheapest-Fossil-Fuel-on-Cost.  Just in Texas: Resource Adequacy (ercot.com)

About the Author

Dr. Andrew Grimshaw is President of Lancium Compute. In his role he oversees Lancium’s work to offer affordable, sustainable high throughput computing solutions through our Clean Campus. He previously served as Lancium’s Chief Software Architect. Andrew has spent his entire career in high performance computing. Prior to joining Lancium, Andrew was a tenured Professor of Computer Science at the University of Virginia; co-founded and served as Chief Technology Officer of Avaki, a computer software company; and served as the Vice President of Engineering for Software Products International. Andrew serves on the Scientific Advisory Board, Field of Information for the Karlsruhe Institute of Technology. He holds a Bachelor of Arts in Economics and Political Science from the University of California, San Diego and a Doctor of Philosophy in Computer Science from the University of Illinois, Urbana-Champaign.

About the author: Tiffany Trader

With over a decade’s experience covering the HPC space, Tiffany Trader is one of the preeminent voices reporting on advanced scale computing today.

EnterpriseAI