The race to ever-better flops-per-watt and power usage effectiveness (PUE) has, historically, dominated the conversation over sustainability in HPC – but at SC22, held last month in Dallas, something felt different. Across a bevy of panels and birds-of-a-feather sessions – both sustainability-focused and more general – the message became clear: the conference’s eyes had shifted to carbon emissions and energy costs.
Perhaps the most boisterous session, “Addressing HPC’s Carbon Footprint,” featured seven participants: Jay Boisseau, HPC and AI technology strategist at Dell; Andrew Chien, professor at the University of Chicago and senior computer scientist at Argonne National Laboratory; Andrew Grimshaw, president of Lancium Compute (and moderator of the session); Dieter Kranzlmüller, director of the Leibniz Supercomputing Center (LRZ) in Germany; Vincent Lim, executive director of NSCC in Singapore; Satoshi Matsuoka, director of RIKEN in Japan; and Alan Sill, managing director of the High-Performance Computing Center (HPCC) at Texas Tech University.
Moving past PUE and flops-per-watt
Grimshaw’s company, Lancium, had first come to our attention during this panel’s predecessor at SC21. The company’s pitch, essentially, is to position cheap, hot datacenters in west Texas, where an overabundance of variable, congested renewable energy leads to frequent negative energy pricing (where users are paid to accept energy load) and near-constant low prices for much of the rest of the time. For the remaining 5% of the time – when demand increases, fossil plants kick in and prices go up – Lancium stops running workloads in its datacenters. The result: fully renewable datacenters with bargain-bin energy prices – as long as you can stomach putting your workloads on pause every now and then.
“If we want to do low-carbon … computing and be really inexpensive, we need to move computing to the load,” Grimshaw (pictured in the header) said. “We’re well-suited to do that in HPC because, believe it or not … we have loads that we can pause.” HPC workloads, he continued, tend to operate in batches and typically don’t have a human in the loop. “If you stop it for 20 minutes, at the end of the day, nobody will really know.”
“If you flex with the grid, it’ll allow you to access low-cost, low-carbon power.”
Grimshaw was joined in this refrain by Chien, who hosted the panel last year. “What we’ve learned over the last five years of studying this problem is that actually, there’s an opportunity for greater-capacity, lower-cost HPC if you think about these things in the right way,” he said. “If you flex with the grid, it’ll allow you to access low-cost, low-carbon power. … This doesn’t have to be a downer, doesn’t have to be a restriction and doesn’t have to be a more expensive [option].” The macro-level electric grid trends, Chien said, were also headed in this direction: the grids need it just as much as HPC does. And: “This notion of stranded power or excess renewable power is much more widespread than west Texas,” he said, “so don’t think this doesn’t exist in your geo.”

This premise is about as total a rejection of the flops-per-watt and PUE metrics as you can get: cheap, inefficient hardware and infrastructure that instead derives sustainability from its location and operation. Grimshaw wasn’t shy about that. “We’ve been using these metrics in the community [for] at least the last 10 or 15 years – power efficiency and flops-per-watt,” he said. “Flops-per-watt isn’t the right metric, because if watts are free – including carbon-free – why focus on that? What we really care about is flops per dollar (and power is our problem), or flops per kilogram of CO2 if we’re concerned about that. … Flops-per-watt was really a proxy for what we really cared about, which was CO2 or energy cost.”
Boisseau similarly cast doubt on the energy efficiency regime. “I spent years and years,” he said, “trying to figure out how to build a power-efficient system and then a datacenter where I could get the PUE from 1.3 to 1.2, or [from] 1.2 to 1.1 – and yet, it was in Texas, it was all fossil-based fuels anyway, so I really needed to address the 1.0 with sustainable energy at some point.” Boisseau said he liked what Lancium was doing, adding that Dell – while a leader in hardware – had “not been an innovator in delivery models” and would be “taking a more proactive approach on that going forward.”
“I spent years and years trying to figure out how to build a power-efficient system … and yet, it was in Texas, it was all fossil-based fuels anyway.”
“I’d really much rather see a metric that is carbon efficiency than PUE,” Boisseau later added. This was a sentiment he would reiterate the following week at one of Dell’s webinars: “I’ve actually proposed that it should be measured in carbon per flop instead of flops-per-watt. Flops-per-watt you want to be an increasing number, but what you really want to zero out is carbon, so trying to reduce carbon per flop to zero would be great. And it really changes the way you think about energy if you’re using 100% green energy.”
Chien also suggested that the kinds of high-efficiency hardware measures that reduce PUE might not mesh well with efforts to shape workloads to the grid. “Several people were very proud of the fact that they were able to drive down their PUE by using hot-water cooling,” he said. “I think that was a good idea; I think it’s the wrong idea going into this world.” If you want to flex your capacity up and down, he said, you need to be able to increase your heat-carrying capacity out of the datacenter. “The way you increase the heat-carrying capacity out of the building is by dropping the temperature of the water and increasing the flow rate, both of which increase PUE.”
Lim shared difficulties managing energy efficiency in high-humidity, high-heat environments, but added that there were issues with relocating workloads across international borders. They were open to hosting in other countries, he said, but “the most challenging thing about that is to take care of the data sovereignty issue.”
Rising energy costs
Elsewhere in the world, rising energy prices caused paradigm shifts for two of the other panelists. Matsuoka, speaking for the #2-ranked Fugaku system, shared that the prices had forced a dramatic move for RIKEN, showing a graph of Fugaku’s operating over the course of the year with a steep drop over the last months. “That’s when we had to turn off 30% of the nodes because we were facing financial crisis due to this sudden surge in electricity prices,” he said. Matsuoka was, however, reluctant to fully endorse variable capacity as a solution. Amortizing the billion-dollar cost of Fugaku over five years, each year represented $200 million in capital expenditures; even in the face of $40 million a year in energy costs, shutting it down represented a net loss on investment.

Instead, he said, RIKEN would be pushing its user community to pursue energy efficiency. “Starting next year, we plan on allocating people energy instead of runtime hours,” Matsuoka said.
“We plan on allocating people energy instead of runtime hours.”
Kranzlmüller, meanwhile, shared that the rising energy prices had opened doors for managing the heat from LRZ’s SuperMUC-NG system. While SuperMUC-NG operated at a very low PUE – 1.06, thanks to hot-water cooling – they had been unable, until recently, to engage in the kind of waste heat reuse seen with the LUMI system. “We wanted to use the heat from the system just for heating – a very simple, straightforward thing. We wanted to do this for ten years, and nobody wanted to take the heat because it’s too much effort, you need to connect it to the loops and so on,” he said. “Today, with the strange global situation, suddenly they want our heat. … You see how stupid that is? We could have done that years ago.”
In response to an audience question about moving workloads to more sustainable colocation sites that used these kinds of technologies, Sill mentioned Quebec-based startup QScale, which is intending to use heat from its ultra-renewable HPC datacenters to help warm greenhouses in the Canadian winters. HPCwire had a chance to visit QScale’s first datacenter in October, and the company – like Lancium – made its booth debut at SC22.
A different conversation
This session was just one of many to discuss these themes at SC22: another carbon footprint session was hosted the following day (sadly, we were unable to attend), along with several other sustainability-oriented sessions and meetings. Discussions of carbon emissions and power savings pervaded vendor announcements and general panels, and the ACM announced that climate change research is taking over for Covid-19 research as the Gordon Bell Special Prize subject that will be awarded over the coming years.
Curiously quiet among the sustainability news at SC22 was the Green500 and its associated birds-of-a-feather session. There was news, of course: as we covered during the conference, Nvidia’s H100 GPU debuted in a small system named Henri (operated by the Flatiron Institute), achieving unparalleled flops per watt in its Linpack run and dethroning the Frontier-style HPE/AMD systems that now dominate even more of the top ten than they did when they debuted on the May list.
Experientially, though, discussion of these efficiency achievements was relatively muted at the conference (despite AMD’s ubiquitous, tree-lined banners advertising that its hardware powered the world’s most energy-efficient supercomputers). Instead, the conversations around flops per watt and PUE seemed to fade against an ever-louder, increasingly urgent awareness that the climate and financial costs of powering supercomputers have finally become too hefty to ignore – and hardware isn’t enough to stop it.
Support Lumiserver & Cynesys on Tipeee
Visit
our sponsors
Wise (formerly TransferWise) is the cheaper, easier way to send money abroad. It helps people move money quickly and easily between bank accounts in different countries. Convert 60+ currencies with ridiculously low fees - on average 7x cheaper than a bank. No hidden fees, no markup on the exchange rate, ever.
Now you can get a free first transfer up to 500£ with your ESNcard. You can access this offer here.
Source link