Advanced Computing in the Age of AI | Tuesday, August 16, 2022

Advanced Scale Computing in the Public Cloud: It’s Exploding, for the Right Workloads 

One of the most significant takeaways from SC15 last month was issued by industry watcher IDC: HPC in the cloud is exploding. More than 25 percent of the hundreds of organizations surveyed by IDC use the cloud, a jump of nearly 100 percent in four years. For sites that use clouds, the share of HPC workloads in clouds has risen to 31 percent.

But in any discussion of HPC and public clouds, the conversation quickly moves to cloud-appropriate workloads. For the right jobs in the right organizations in the right business sectors, there are instances of HPC in the cloud implemented on a massive scale. Right-sizing, in fact, is one of the key attributes drawing more enterprises to the cloud for advanced scale computing workloads.

The promise of cloud computing is to enable user organizations to tailor the technology they use to their specific needs by selecting the cloud services provider with the right instance type – the right amount of memory, the right number of cores, the right kinds of processors, the right network topology – for the user’s workload.

To be sure, even cloud enthusiasts agree that public clouds are not appropriate for high-end workloads requiring 150,000 or even 50,000 cores. It’s the bread-and-butter workloads requiring between 2,000-8,000 cores that can be ideal for public cloud workloads.

“The dynamic that cloud changes and that people finally are getting their arms around,” said Tim Carroll, vice president of sales and ecosystem for cloud computing specialist Cycle Computing, “is it gives us the ability to fit the compute to the science or to the line of business as opposed to the traditional way, which is the person with the workload trying to figure out how to fit their problem to the compute they have available. The demand for compute and data is growing at a pace that’s going to far outstrip the ability of people with conventional internal infrastructure to meet those demands.”

Take MetLife, one of the largest providers of insurance, annuities, and employee benefit programs with 90 million customers in more than 60 countries, all of which adds up to data-intensive computing needs. For Brian Cartwright, assistant vice president who supports MetLife’s actuarial area, the cloud is the right solution to many problems he’s tasked with: running large numbers of actuarial calculations, financial projections and frequent financial reports – all of it while complying with regulatory requirements under fail-safe deadlines.

“It’s all about time with us,” Cartwright said at an SC15 panel discussion on HPC and the public cloud.

MetLife's Brian Cartwright

MetLife's Brian Cartwright

And it’s about workload cycles. After reports are submitted and deadlines temporarily recede, some on-premises compute capacity at MetLife goes unused. But as deadlines near, capacity becomes overextended, calling for a burst into the cloud.

“The cloud was a great opportunity for us to be able to expand our workloads, add capacity and not have to build up an on-prem infrastructure that would sit idle for other times of the month,” Cartwright said, “this way we only paid for what we were using in the cloud.”

For MetLife, that means Azure, Microsoft’s cloud computing platform.

Jeff Baxter, Azure Inside Opportunity Manager at Microsoft, joined Cartwright on the SC15 panel. He said the emergence of the cloud has profoundly impacted Microsoft. “We don’t see ourselves as a Windows company, we see ourselves as a cloud provider.” That includes supporting customers running Linux or any other platform. “We are very much agnostic, and we’re supportive of people who want to run workloads that aren’t on our platform.”

Baxter said that for Azure, HPC-class workloads are defined for now as traditional engineering, oil and gas, computational fluid dynamics (CFD) and finite element analysis (FEA) applications topping out at about 8,000 cores. He said Microsoft is pushing out hundreds of thousands of new cores to Azure each month, and that it is this steady increase in capacity that will bring more organizations into cloud computing.

“We quite firmly believe the future of HPC for many organizations is in the cloud,” Baxter said. “There are a number of workloads people would like to be running. For example a number of the major automotive companies can’t do side impact testing and other crash testing because of lack of ability to get on their clusters. There’s a large number of workloads that are somewhat cyclical. So having a set of resources sitting round 20 percent utilized is not an economic decision.”

That fits the cloud model at MetLife.

Cartwright said MetLife began using grid processing more than 10 years ago and started experimenting with cloud computing in 2012. By last year his team had begun moving some production workloads out to the cloud and set a goal of having 50 percent of its processing run on the cloud in 2015, a goal they exceeded. On average, Cartwright’s team runs about 1.2 million cloud-based hours of financial calculations per month.

“We did this to manage all that high demand,” he said. “There’s a lot of cost review at our company. We needed to know how we can do this cheaply, and the cloud was really the answer.”

Cartwright said that because of the reporting cycles his group must work within, utilization of on-premises hardware was relatively low. But by moving many financial calculation workloads off-site, in-house compute utilization is up about 20 percent in 2015 over the previous year.

“We didn’t have to have to take a hit on our fixed cost capacity,” he said. “We were still able to meet all of our [service-level agreements], we didn’t have any delays, it worked out well for us.”

A major cloud advantage is quick and inexpensive computing upgrades. In the past, improving capabilities meant replacing or refreshing on-premises servers, along with setting them up – a time consuming process. “What’s great about the cloud is we’ve been able to reimage to a better compute node really quickly.”

The major challenges in moving to the cloud, Cartwright said, were the technical challenges involved with complying with company rules, government regulations and security requirements while porting applications to the cloud during the initial set-up phase, as well as use and cost management within a pay-as-you-go pricing model.

“I have business customers that have a thirst for doing different computations; they like to see these calculations run over and over, we’re running thousands and thousands of scenarios, they would love to have more and more capacity available to them,” Cartwright said. “They think: ‘The cloud is there, we can do whatever we want as much as we want, as long as we want.’”

But there’s a price: Having reduced MetLife’s fixed computing costs while adding more compute capability, “now the trick is to throttle it back and make sure there’s accountability for what’s used out in the cloud.”

The team manages this by placing priorities on jobs while monitoring who is using them “so we can tell who’s running what. This kind of accountability is important when you have users who want to just go out and run anything.”

At Schlumberger, HPC in the cloud offers the elasticity, the pay-per-use utility and the technology flexibility that helps the oil services giant deliver real-time results to its customers. It’s a highly complex task involving integration of IoT, big data, analytics, machine learning, visualization and security. But to Kanai Pathak, advisor and technology platform manager at Schlumberger and another member of the SC15 panel, HPC in the cloud also is important for maximizing the expertise of the company’s computing and oil field talent.

“The excess capacity I have to manage is the experts we have on staff,” said Pathak. “If I can have a tester go to a cloud and try all his test scenarios all in parallel, now I get more out of the tester community. If I get the developers to try five different hardware configurations and benchmark it without trying to acquire different new hardware configurations, now I have a better understanding how much faster I can run what I need to do. That’s where the cloud started making a lot of sense to us. The excess capacities in human capital and the expertise we have in the company is basically what we can tap into because of the cloud.”

Yet for all the rapid uptake of cloud there remains a number of barriers to more widespread adoption, issues ranging from cost, set-up complexity to, yes, security.

But Steve Feldman, senior vice president for IT at CD-adapco, a CFD, computer-aided engineering (CAE) and FEA software company that claims 8,000 users at 3,000 companies, believes cloud security concerns are overblown.

“The biggest reason in my industry for not going to the cloud is security worries,” said Feldman, another SC15 panelist. “People have to place their IP on the cloud and it scares them. But I think it’s completely unfair. I think most clouds probably have better security than internal networks, cloud people are experts in security. It’s one of those things where they don’t want to do it [move to the cloud] they start picking on reasons for not doing it.”

He said a more legitimate concern is moving data to and from a remote cloud, “especially if you’re a small organization, you probably haven’t paid for large network bandwidth to your site, so it’s expensive.”

Feldman said he’d like to see a modification in cloud pricing models in which users are charged on a pay-as-you-go basis for moving data. “I’m not aware of any service provider that lets you burst and pay for a burst. Instead, you have pay for that 100MB/second line all five days you’re on the cloud whether you’re using it or not.”

Add a Comment

EnterpriseAI