Advanced Computing in the Age of AI | Thursday, March 28, 2024

Prodigal Cray CTO Looks Ahead To Converged Analytics 

At the end of the summer, Steve Scott, the chief technology officer who brought many of Cray's interconnects and systems to life in the past several decades, came back to the company after taking key roles at graphics chip maker Nvidia and hyperscale datacenter operator Google. In that time, Cray has been pushing beyond its supercomputing roots into the enterprise, particularly for high-end analytics workloads. Scott has come back to Cray in part to help drive the next generation of system architectures that will be able to run both complex analytics and simulation and modeling workloads.

Scott sat down with EnterpriseTech recently to review the path he took in recent years, what he has learned from that journey, and what kinds of technology Cray is pondering as it looks to the future.

Timothy Prickett Morgan: First, an observation. The circuit you have made is an interesting one, and it kind of mirrors what EnterpriseTech is all about. I am curious what you have learned being at Nvidia for two years looking at GPU acceleration and then working at Google on systems design for a year. EnterpriseTech is interested in hyperscale and its needs, HPC and its needs, and enterprise and its needs, and like many people, we think there is something going on--I don't want to say convergence because that makes it sound all lovey dovey and it is not necessarily like that. But nonetheless, there is a need to build systems that look something like HPC but which run what are enterprise workloads. And obviously the future "Shasta" effort is interesting to me in that it could be a machine that is created to do multiple workloads and dataflows. To my eye, storage is kind of at the center of it and various kinds of computation kind of hover around it.

So that is my observation. Now, tell me about the things you learned from outside of Cray in the past three years?

Steve Scott: Nvidia was still in the HPC industry and I didn't feel like I was leaving the fold. I had spent 20 years at Cray and wanted to go off and try things on the processing side, and I had become a believer in heterogeneous computing. I am not sure how much unique perspective I got at Nvidia, but I came to Nvidia with an agenda to make GPUs easier to program and to develop a long-term roadmap to make them more integrated.

I am still a believer in heterogeneous computing because of the inherent technology changes we are faced with in tight power constraints and Amdahl's Law. We need to have fast serial processing and we need to have efficient parallel processing. So the basic idea, I am still a believer in. It is nice to be back at Cray and to be a little bit more open about how you get there and not just think that GPUs are the answer. I still think GPUs have a bright future and they have a role to play as well.

TPM: This reminds me of conversations you and I had many years before the current system was called "Cascade." The idea back then was to make a machine that could use any type of computing element: X86 processors, multi-threaded vector engines, DSPs, FPGAs, or whatever.

Steve Scott: There are still reasons to use specialized processors for different sorts of workloads. One thing that is common across all of them is the need to move data well, the need to deal with locality, and to worry about the interconnect. High performance computing really is about dealing with data versus performing flops. That's one of the things that ties everything together – what we were doing at Nvidia and what we are doing at Cray now as we bring in all of the data analytics.

TPM: What is the plan for interconnects? When you sold off the interconnect business to Intel, you retained rights to use Aries and a potential for an Aries upgrade and then Shasta after that. That was pretty much all that was said publicly. To put it bluntly, you are the expert at this interconnect stuff, so when you come back to Cray, that makes me think that Cray might be interested in new interconnects.

Steve Scott: Interconnect definitely matters, and I wish I could be more forthright about future plans. I can only say that we are looking at all sorts of options. We have a great relationship with Intel and Intel has some pretty aggressive network plans. We will build particularly good systems based on those interconnects, but we are looking at other alternatives as well. There are commodity networks and other things that could be done. But it is a little too early to say anything.

TPM: I like the idea that you are looking around and thinking about it. Just as there is not one kind of compute for every job, I don't think there will necessarily be one kind of interconnect for every job. There may be a system or hierarchy of interconnects, and you tell me if that is stupid or not. Enterprises have Ethernet with a smattering of InfiniBand where they need higher bandwidth or lower latency or both, today, for instance.

Steve Scott: There is a reason why there are Ethernet systems and InfiniBand systems and custom interconnects. They all have a place in the ecosystem. One of our jobs going forward is to build the kind of infrastructure that gives us the flexibility for different interconnects and processors. So that story doesn't really change from five years ago and the Adaptive Supercomputing vision from Cray.

TPM: So what lessons did you learn from Google? They have a slightly different set of issues they are wrestling with.

Steve Scott: Google is an incredible company to work for. I like the company a lot. The scale they are working at is just mind boggling. So Google is very impressive from that perspective.

Their systems are not custom-designed for high performance computing, and they are not custom-designed for high performance enterprise computing. They are primarily designed for user-facing Web applications, and there is an incredible amount of software that they have invested in that creates a set of services that very quickly allows them to provide new user-facing capabilities, taking advantage of all of the ways of moving data, saving data, and accessing data with different types of resiliency and performance. So they have an incredible global infrastructure that makes it easy to do things across different geographies.

They are not, again, interested in building systems for high-end enterprise or high performance technical computing like Cray does. I certainly gained an appreciation for cloud computing and big data applications of a certain type at Google.

At Cray, we are focused on data analytics that requires high network performance, everything in memory, low latency – aggressive analytics, as opposed to just dealing with large amounts of data.

You can think of a pyramid of data analytics where a piece of the pyramid involves lots of data but the analytics is not performance intensive, and then there is another sort where you want to keep all of the data in memory and you are doing aggressive analysis, whether it be Hadoop-style or graph analytics. MapReduce can be used for problems where you don't need a strong interconnect, but there are also problems where you do need a strong interconnect. If customers are doing Hadoop analytics on commodity clusters, there is a limit to the kinds of analysis they can do. We have customers that are interested in doing higher performance MapReduce as well as graph analytics.

It has been nice to come back to Cray after a few years and see us finally see some traction in the commercial space. Part of what attracted me to come back is that we are branching out from core HPC and moving into high performance analytics. I am pretty excited by the potential there. We will leverage the same strengths, and I think there is a play for a company that is focused on really high performance and deep collaboration with customers, who can roll up their sleeves and get customers to do things they would not otherwise be able to do.

Just like there is capacity high performance computing and then the capability high performance computing that Cray has focused on, the same thing will happen in analytics. We are at the beginning of something that is just going to get larger and larger. There is just so much data: All of the sensor data, all of the social media data, all of the sales data. Businesses are starting to track everything and they are just buried in data.

TPM: So what is the first thing you need to do as Cray's CTO?

Steve Scott: I am spending my time talking to a bunch of people and getting up to speed on what we are doing. We reorganized in the summer to have a very strong and unified product group that has responsibility across traditional supercomputing as well as data analytics. I will be putting together the CTO office and reshaping it a little bit.

TPM: Are you interested or intrigued by ARM processors and moving in that direction in terms of heterogeneous compute?

Steve Scott: I am absolutely intrigued by ARM processors. I don't view ARM as having some inherent advantage over X86. It is not inherently lower power as an instruction set, and I do not feel that with ARM processors ARM chip vendors can do something that Intel can't do.

But what ARM brings is a very open ecosystem that allows innovation and lots of players to come and try different things. Over time, an open ecosystem at high volume tends to produce interesting results, and I think it is too early to say that there is a strong market segment where ARM will win. But it is definitely something worth looking at in the context of the flexibility it might provide. ARM in HPC was not going to get anywhere until 64-bit came along. If we talk a year from now, I don't anticipate our processor mix being any different. I don't know if it ever will change, but it is worth tracking.

TPM: Do you anticipate that GPUs or other kinds of accelerators will get a larger piece of the mix? I think that systems are going to be built as you described, with both serial and parallel elements.

Steve Scott: That's right, and Nvidia knows that as well. I still think that GPUs are a good technology. There are other ways of building good, power-efficient, parallel processors, and Intel understands that. I think that GPUs will continue to grow market share, but from a relatively small position. I think the jury is still out five years from now as to what fraction GPUs might have.

TPM: What about OpenPower? At least it is an option now. Is there a place for an OpenPower system at Cray?

Steve Scott: I think the idea that Cray will build systems with IBM processors is unlikely to happen. I can understand why IBM is doing what it is doing with Power, and its partnership with Nvidia could be interesting. But we talked earlier about volume, and from a volume perspective it is going to be a very difficult situation to turn that around with OpenPower. It is not high on my list of things to be concerned about.

TPM: What else are you thinking about when it comes to future systems?

Steve Scott: There are some very interesting things happening around non-volatile storage. We are likely see at least two more levels of the memory hierarchy coming into play, with stacked, on-package memory that has very high bandwidth and very low power but which has very limited capacity and is more expensive. This will sit between last level cache and main memory. And then there are SSDs, currently with NAND flash and later with interesting technologies such as PCM or others, where you can get down to very low latencies. But you will see these going between main memory and storage.

From a software perspective, you want be able to make non-volatile memory transparent to most users but to also have the hooks to allow advanced users manipulate it. You want to be able to make efficient data motion all the way up and down this increasingly deep stack of memory, with rotating disk being at the very end of it. So compiler technology and data movement technology need to merge together in the software and manage that memory hierarchy.

The two things that are particularly interesting to me is dealing with this non-volatile memory and how it fits into the stack and how do we combine data analytics and traditional simulation analytics. That will keep us busy, I think.

EnterpriseAI