HPC Roundtable – Turnkey HPC Solutions

Webinar Synopsis:

Speakers:

  • Gregory Kurtzer,  CEO, CIQ
  • Jeremy Siadal, Senior Software Engineer, Intel
  • Partrick Roberts, Technical Director Design Automation, Skyworks Solutions, Inc.
  • Dave Godlove, Solutions Architect, CIQ
  • John Hanks, HPC Administrator, Chan Zuckerberg Biohub
  • Gary Jung, Lawrence Berkeley National Laboratory
  • Jonathon Anderson, Solutions Architect, CIQ
  • Trevor Cooper, HPC Systems Programmer, University of California San Diego
  • Forrest Burt, High Performance Computing Systems Engineer, CIQ

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Gregory Kurtzer:

Hi everybody. We don’t have Zane today. Zane typically leads these, so all of you are stuck with me. We’re going to do something different today. We’re going to have, and you could see by the number of people that we have here and guests that we have, we’re going to have a number of different people and topics around high performance computing, specifically turnkey high performance computing. We will go around the table and discuss various aspects of some of the basic use cases we need to solve and what it takes to create and build a turnkey HPC solution. I’m going to introduce everybody, and I’m going to call out their names and ask them to do a quick introduction based on the order on the screen. Jeremy, welcome, and tell us a bit about yourself.

Jeremy Siadal:

My name’s Jeremy Siadal. I’m a senior software engineer with Intel Corporation. Today, I am representing OpenHPC, an open source distribution specifically for HPC.

Gregory Kurtzer:

That’s perfect. Thank you, Jeremy. Patrick.

Patrick Roberts:

My name is Patrick Roberts, and I’m with Skyworks. I specifically work in HPC related to EDA, which is electronic design automation and designing chips and building supers to facilitate that.

Gregory Kurtzer:

Very cool. Thank you, Patrick. Dave.

Dave Godlove:

Hey, everybody. I’m Dave Godlove. Initially, my background was as a basic research scientist in biomedical research at the National Institutes of Health. I have since got into high performance computing and worked at Biowulf as the NIH’s intramural high performance computing resource. I’ve also worked with Greg extensively, helping to develop what was previously Singularity and is now Apptainer. Now I’m here representing CIQ.

Gregory Kurtzer:

Welcome, Dave. Thank you. John, aka Griznog.

John Hanks:

I’m John Hanks. I am an HPC administrator at the Chan Zuckerberg Biohub and manage clusters. Typically, it’s always been life science clusters, although I’ve dabbled in CFD, climate modeling, and other things along the way.

Gregory Kurtzer:

Very cool. Thanks, John. Depending on if there are other Johns, I’ll typically go back and call John, Griznog. Since there is a Jonathon, you may hear me call him Griznog. Gary.

Gary Jung:

Hi, my name’s Gary Jung. I work at Lawrence Berkeley Laboratory and have an appointment at UC Berkeley. I lead the scientific computing group and run the institutional HPC for both institutions. I’ve worked with Greg for many years, and he and I built the HPC programs at both institutions.

Gregory Kurtzer:

Very cool. Thank you, Gary. Jonathon.

Jonathon Anderson:

Hey, my name’s Jonathon Anderson, and I’ve worked as an HPC and CIS admin at a number of sites in the US National Lab and broader academic space. Today I’m happy to represent CIQ as an HPC systems engineer and solutions architect.

Gregory Kurtzer:

Thank you, Jonathon. Trevor.

Trevor Cooper:

Hi, my name is Trevor Cooper. I’m an HPC systems programmer with the data enabled scientific computing group at the San Diego Supercomputer Center. I’ve been working with Greg for a while now in the Rocky project, and I’m here to lend my input on this panel.

Gregory Kurtzer:

Excellent. Thank you, Trevor. Forrest.

Forrest Burt:

Hey everyone. My name is Forrest Burt. I’m a high performance computing systems engineer here at CIQ. I do a lot here at CIQ, but primarily with our Fuzzball system working on HPC 2.0. I come out of the academic sphere. I graduated college about a year ago, and while in school, I worked as the student’s admin at the institution I was at. I also got to work a little bit in the National Lab space regarding high performance computing. Very excited to be here and having a blast at CIQ so glad to be on the panel as well.

Solving Problems with Traditional HPC Systems [00:05:26]

Gregory Kurtzer:

We had a dry run yesterday where we all got together and talked about things, and I wish I could have paused that dry run and just superimposed it on what we’re going to talk about here. There was so much information there that was just so fantastic that I wish it would’ve been live. What that told me is we are going to have this group back again multiple times, I think, and we may even make this more reoccurring when we have panel discussions like this because it was so cool to hear. We’ll try to replay some of that and focus on different topics. Everybody’s fresh with it and bringing their ideas like they didn’t just say them the day before.

We’re going to start talking about what problems we need to solve for a traditional kind of sweet spot HPC system. When I say sweet spot system, many of us have experience with very large institutional level systems, but that’s not the primary use case of most people in the community. Most people will buy a 50 to a hundred node system from a vendor. And they will be looking at how we easily manage this system in many cases. These are postdoc, these are grad students, and these are undergrad students. Sometimes these are the most junior member of the team, the newest person to come in, or whoever lost the bet on who has to maintain the HPC system and install it and whatnot.

It’s our job to think about our experience and figure out what we need to be thinking about and providing to make their job more successful. We have so much collective experience in this room, a virtual room. How can we help with that? Let’s start with talking about some of those general purpose jobs that we all have experience with in high performance computing. Let’s talk a little bit about those requirements. I’d like to hear from the panel about what types of jobs we are seeing. This historically used to be a lot of MPI and tightly coupled jobs. Is that still the case, or are we starting to move to a broader diversity of jobs at this point and maybe less MPI? What are people’s thoughts on that?

Patrick Roberts:

In my arena, it depends on the software vendor we rely on, or even in some instances, the specific task that’s trying to be accomplished. I’ve seen a lot of MPI and a lot of non-MPI jobs, and a lot of MPI jobs that are not explained by the vendor that they’re MPI jobs, which is always fun because we use many commercial tools in my industry. Many of those have many underpinnings, which are not fully disclosed or explained, and you have to use a crystal ball to figure out what exactly they’re trying to accomplish on the back end, and then profiling it and figuring out that we have to optimize for specific tools in those environments.

So it’s really a mixture of MPI-based stuff and stuff that wants to grow to encompass all of one physical machine. You can sometimes lie to it, hide that you’re on one specific physical machine, and grow beyond that. But then you get into licensing issues, which again, I’m not sure everybody here has to deal with, but that’s one specific thing in my arena that I have to deal with a lot is vendors tend to juggle the license requirements quite a bit to optimize their cash gain. That can really change what you have to provide the software. Sometimes, you will pay a more significant price if you’re utilizing a GPU. If you’re using multiple sockets, you’ll pay a different price, if you’re using multiple cores. It’s constantly a juggle to meet the needs and demands of at least my customers.

Gregory Kurtzer:

Patrick, what percentage of the applications you’re dealing with in EDA are traditional MPI versus serial or embarrassingly parallel?

Patrick Roberts:

Only about 25% utilize some form of MPI in the back end. Sometimes that’s open MPI; sometimes that’s Intel; sometimes, that’s whenever the implementation of MPI is required. And again, it’s not always disclosed that that’s actually what’s happening. A lot of vendors like to package that and then hide it, as I said. From my experience, about 25% use MPI, and then others, it’s a mix outside of that, what you end up with, so it’s hard to quantify it.

Gregory Kurtzer:

Is that about a normal percentage in terms of what we’re seeing in other scientific disciplines? What about bio, for example?

John Hanks:

Yeah, the graphic that Trevor had yesterday pretty much sums up what I see on my systems and what I’ve seen throughout all my life science-related clusters. Almost everything is a single node and a lot of it’s single core.

Trevor Cooper:

I’d agree with that. We talked about this yesterday a little bit too, and I want to reiterate that a big change I’ve seen in the few years that I’ve been working in this area is the size or the number of resources on a per node basis available. When I first came to SDSC, our biggest system had 16 cores per node; I think it had 32 or 64 gigs of RAM. I’m installing nodes in our system today that have 128 cores and 512 gigs of RAM each. And those aren’t the large memory nodes – like the large ones have terabytes of RAM. On our old system, we would  actually run software that would build virtual – it was called SMP – virtual systems with a large memory that appeared to the users to have a lot of memory.

Nowadays, we don’t need to do that. We can deploy nodes in the cluster that have more cores than, say, 10 or 20 nodes had in our old system, not even ten years ago. The resources on the nodes now eclipse what has been available before. There isn’t a need to support something like MPI generally for the vast majority of our users, at least. We have codes that actually bundle MPI inside under the hood, and users will bring those, and they used to run on a multi-node before, but they can run those codes with MPI locally on the same machine now. They don’t change anything, and it looks like a 32-core or 64-core job that used to take four or eight nodes before now. It all runs within a single node, and they don’t do anything. They just run the application, running MPI under the hood, just because that’s how the developers coded it. 

MPI vs. Per Node Systems [00:13:32]

Gregory Kurtzer:

That’s a really interesting point. When I first started getting into HPC, there were not multi-core socket chips. And as a result, if you wanted to get up to a hundred processes and a hundred cores, you had to have 50 nodes, generally speaking, in dual socket systems. It’s a very different situation right now. Gary, I know you run some big systems at Berkeley and support many scientific disciplines. What are you seeing in terms of tightly coupled MPI versus per-node systems or jobs?

Gary Jung:

That’s a really interesting question. I pulled some numbers last night and looked at it, and as most people would surmise, it depends on the domain. We see people, like the climate people and material science, nanoscience guys, doing a lot still doing traditional MPI, multi-node jobs, but at the lab, well altogether at the lab end at UC Berkeley, we manage about 80,000 cores. It goes across many disciplines, but the average number of cores that a job would use I was surprised to find out that the number is 15. They’re not just single core jobs, but it is a mix. But I thought that number was interesting because we were talking about fat nodes, and we’ve always been buying fatter and fatter nodes. But maybe that’s not what we need to do to get better utilization. We will have to move from scheduling from exclusive nodes to shared nodes. That’s kind of what we’re seeing. The large majority, if you look at the job number by numbers of jobs, then they’re all like single node jobs.

Gregory Kurtzer:

Got it. So maybe this is a little bit disciplined, specific, Gary. I heard you say that the climate scientists and whatnot, they’re still doing very tightly coupled parallel MPI focused workloads. And I know we have some additional kind of realms here that we could be talking about, but I’m going to wager and say that it’s really specific to the different types of science that’s going on within a system, to the point where maybe the system that we’re talking about in terms of a turnkey system may actually be a little bit dictated by the type of research that they’re doing. So let’s expand this a little bit, right? We talked about MPI, which kind of leads right into the interconnect and the interconnect fabric that you need. If you have tightly coupled MPI going multi-node, you may need a higher performance interconnect fabric. And if you don’t need that, the interconnect fabric may not be quite as important, but what about GPUs? Are we seeing a trend still of increasing GPUs?

Jeremy Siadal:

I would say yes. We’re looking at certainly more instances of hybrid systems, traditional mixed with GPUs mixed with scale-out to alternate platforms, such as extending jobs out to the cloud as needed. I was also going to bring up one thing: I don’t have a lot of visibility into the end user applications in my job, other than what I see through OpenHPC, which is very MPI-focused, traditional HPC focused. But with the other groups I work with, there’s certainly a lot more focus on visual computing and hybrid traditional HPC and AI. If you look at an organization, you don’t have infinite money to buy multiple systems that are tailored for each discipline you want to run. You look to get a good hybrid system that can cover both. Greg, I totally agree that in  traditional MPI, fabric is going to be extremely important, but I will also say that fabric is important even in any large data set applications where you’re going to have to run interconnected storage over that same fabric.

John Hanks:

In my current environment, InfiniBand is primarily a storage network. We’re rolling it out aggressively at 100 and 200, mainly for a storage network but also as a hedge for if and when the day comes when we are doing multi-node GPU jobs, and we need that bandwidth.

Geoscience and MPIs [00:18:47]

Gregory Kurtzer:

So Mystic Knight is mentioning: any insights for geosciences, such as oil and gas imaging? Gary and I actually have quite a bit of experience working on those sorts of workflows at Berkeley. Some of the bigger systems, some of the early big systems we created, were specifically around oil and gas and imaging. In terms of the insights that I have, and I’m going to ask Gary to jump in as well as anyone else – Jonathon, you may have experience in this as well – but typically, I’ve seen that for that sort of imaging, they’re going to break that into a lot of smaller three-dimensional pieces that they would run over MPI. And typically, those sorts of jobs scale extremely well over large systems using MPI. Gary and Jonathon, any additional comments or thoughts on that?

Gary Jung:

I think a lot of the geocodes work well on MPI. Geophysics and geochemistry, those guys still do that type of thing. And many of the climate people have gone on to also use  national systems, leadership class systems, but we still see a lot of the other people in that domain using what we’re calling the sweet spot HPC systems.

Jonathon Anderson:

My experience is that in a domain like geoscience and oil and gas, these are HPC bread and butter applications. These applications grew up and co-evolved with HPC as a medium. And what we’re seeing with the prevalence of these single core, single node, non-MPI workloads are really from a new entire set and a new community of users who are more likely to have grown up in cloud resources where things like parallel computing wasn’t as much. That’s not how their applications were developed. They were more developed around the cloud paradigms. And so we’re seeing that get re-ingested into what we’ve thought traditionally of as the Beowulf cluster style HPC work of the fabric with a low latency fabric and MPI. And I think that’s a big part of where the impedance mismatch comes from. There are applications that evolved in different environments.

CXL [00:21:18]

Gregory Kurtzer:

That is a fantastic point. Speaking of the fabric, Mystic Knight, great questions: “Any predictions for usage of CXL, will it replace fabrics like InfiniBand?” I would actually even expand that to basic PCI switching and leveraging the PCI fabric directly and running messages passing over that fabric, instead of using something to come down further down the stack with an InfiniBand interface – could we do PCI directly and do that messaging? I’ve seen a little bit of this personally. I’m curious if anyone has seen some uptake of this and maybe some cool ideas coming out of this space.

Patrick Roberts:

I’ve seen some examples. I’ve never seen anything in the wild, but the science projects I’ve seen with it are pretty neat. But they’re just science projects because they’re generally not practical due to cost. I mean, the cost of something like that at this time is so astronomical that you are only going to see it, in my opinion, you’re only going to see it in some of the show pieces that are out there. You’re not going to see that boil down just yet. It’s not quite ready for prime time, and there’s not an economy of scale to bring the cost down to where it’s feasible to implement at this time. But what I am seeing is, and I’ve seen quite a bit of this recently, push back on traditional interconnect speeds.

There’s a huge push to get away from 20 and 40-gig links to 25 hundred and 400-gig links, coming in different directions. It’s mostly coming from hardware vendors; sorry, I don’t want to name names from certain hardware vendors. It’s causing problems because the switch vendors want to go in a different direction. And they want to stay towards the 40 gigs, the 80s, and the 160s, and that sort of thing. There are competing technologies at play. And I don’t know if it’s due to supply chain shortages or what, but it is causing havoc with getting nodes built out that I’m experiencing.

The Flexibility of a Turnkey HPC System [00:23:51]

Gregory Kurtzer:

I’m hoping CXL helps with much of that by creating a standard interface. Regarding scale, I think that having smaller CXL or PCI switching fabrics and then clustering those smaller PCs may help achieve an economy of scale. Different from just having a very large single flat network, like what we see typically with InfiniBand or Ethernet. I think there may be ways to help manage that. But I’m hearing that there may not be one size that fits all. I even push back on that notion. Maybe this will be a little more about if we’re considering some turnkey HPC system. It needs to be fairly flexible about what must be part of that. Let’s start moving in that direction, like what is needed and, Jeremy, your experience. Sorry, was there another thought?

Dave Godlove:

Well, I was going to say along those same lines, if we’re thinking about sweet spot sized, as you’ve said, 50 to 100 nodes. We’re talking a lot about MPI, what workloads use MPI, and things of that nature. The types of applications we’re talking about, the material sciences and things, are applications I would contend that probably aren’t going to run very often on the sweet spot sized machines. When we’re talking about those, we’re talking about individual labs or maybe small groups of labs that are buying this hardware and acquiring it, running it themselves. Big oil and gas is not going to be running workflows on these size machines. 

Use of MPIs [00:26:00]

Gregory Kurtzer:

That’s a really interesting point from Ramin – “I guess 15 cores on average is a result of many small jobs with a short wall clock runtime, maybe an optimisation, decomposition, or post-processing job” – and I’d be curious about Jeremy and Forrest’s take on that. Jeremy, I know OpenHPC focuses a lot on the turnkey distribution of this. What kind of use cases are you seeing? Are you still seeing a lot of MPI uses?

Jeremy Siadal:

Of course. As I said, it’s in terms of what I’m seeing and supporting. Intel is moving into hybridization. It’s adding additional components like GPUs as well as being able to support different workloads like AI and machine learning. You have a lot of systems available. The MPI applications are out there, and people are using them. You’re looking for these being able to at least dual purpose, between MPI and running PyTorch or other jobs. But there’s a lot of background discussion going on right now about Ethernet because in terms of turnkey and in terms of what I still think most people want is Ethernet: it’s there; it works. And for most jobs, it’s fast enough, especially if you’re talking about course-grain applications. They’re not going to make use of high-speed fabric. That’s more money put into faster systems instead of high-speed fabrics.

Jonathon Anderson:

For a system this size, like this sweet spot, Ethernet has evolved to an extent where you can put that in, and if you don’t have any cross node parallelism, you just have a network, but a couple of steps later, let’s forget about the supply chain for the moment, just pretend we can buy what we need. You can get a Rocky compatible switch, and you get low latency too, and maybe it’s not as good as the best InfiniBand right now, but it’s a pretty good middle ground for this cluster size.

Gregory Kurtzer:

And doing PXE booting over Ethernet is much easier than doing it over InfiniBand.

Patrick Roberts:

That’s where you come in, where sometimes the best approach is multiple layers, where you have your command and control layer. You’ve got your general interface layer and background disk layer, necessarily, of networking. Sometimes you want to segment that out. There’s not the whole concept of this – I hate to use this buzzword because it’s painful to me as always – hyper-converged, where that doesn’t sometimes work when you’ve got narrow pipes. Breaking it out into multiple segments is sometimes very useful; it adds to administrative overhead,  but the commonality of ether is so common and so inexpensive for the most part these days. You’ve got four ports, minimum per node in most instances. Why not LACP two of those off for the management and interface sides? And two of those off for storage or something along those lines. And you can come up with a fairly elegant solution where you get it to work without too much management overhead.

State of ICR [00:29:38]

Gregory Kurtzer:

That’s a really good point, as we’re starting to think about it again. I know it’s not a one size fits all, but if we were to think of it like that, maybe we’re starting to lean towards a simpler fabric, maybe more robust, more capable nodes is just the evolution of nodes. It sounds like GPUs are still going to be kind of a major aspect of this. We’ll need to be able to support GPUs on some percentage of those resource nodes, which may not be all. But now I’m curious, and we do have a question from InsaneDuck and, Jeremy, I think that this is going to be aimed at you. What’s the state of Intel Cluster Ready (ICR)?

Jeremy Siadal:

I started with an ICR program quite a ways back. Intel Cluster Ready is not a program at Intel anymore. However, it was replaced with the HPC platform specification, which encompasses much of what Intel Cluster Ready was trying to accomplish, which was the standardization of a base platform for HPC to provide a lot of what we’re looking for in a turnkey system in terms of how it’s built and the application. The other thing that came out of the Intel Cluster Ready was the Intel Cluster Checker and that is still very much a product. When you get into turnkey systems, having a tool that allows you to rapidly check that it’s working correctly is definitely helpful. Of course, the benefit now is all of the Intel tools you can get on directly from online repos.

Gary Jung:

Can I jump in here for a minute? I wanted to jump back to that thing about the simpler fabrics. Would this be a good time to do that? We’re seeing researchers where the sweet spot system that they have is one of several. A lot of people typically have a portfolio of computing resources they can access. The sweet spot system we’re building is not only for their use, but one of the other things that they use is for OnRamp purposes to  other systems in their company or national systems, if you’re in the academic world. Being able to OnRamp your application onto other places or even the cloud, then perhaps you want that fast fabric so that you have a similar environment. We even get people who off-ramp on some of these because the high-speed systems may have architectures that are no longer suitable for their codes.

Applications That Support Containers [00:33:03]

Gregory Kurtzer:

That’s a very good point. One thing I wanted to come back to, as we’re talking about applications and use cases, I’m kind of curious. I have a really big perspective and lots of opinions about containerization, as you might all know. What percentage of users and applications are we now seeing that supports containers? If we were to guess, approximately 50% of the workloads are on containers or 20% or 80%. I’m going to jump through the room quickly, and tell me in your experience if you were to wager a guess, what that would be. Trevor, let’s start with you.

Trevor Cooper:

UCSD and SDSC were among the first HPC national systems to put a containerized solution for users into the general space for them to use. We’ve got a long history of supporting initially Singularity and now other solutions. Our users are typically of two flavors: one, they’re an investigator that doesn’t have a huge amount of resources to build containers or computer software development support, or their users that we call users, but they’re really gateway service providers. Those users typically do have full-time staff that can work on these things. We’ve been seeing that several gateway-type user cases are deploying their software stacks with containerized solutions.

Some of our biggest users are gateways that support hundreds or thousands of other users outside of our system. We don’t count them directly on our machine, but they get counted in the big pool. We’ve seen cases where they’re doing that so that they’re more able to move their software stacks between systems. They may have allocations at multiple national systems. This is all grant-funded stuff. Also, they’re starting to branch out and working on a pay-for-use model where their gateway customers come in and provide funds, and they’re able to run on our systems or in the cloud as well and spin up resources on behalf of those users. Those kinds of cases are growing. It’s not typical for I’ll call them small-time or long-time users to be doing that because they don’t generally have access to systems where they can build their containers or have the expertise to do that. People build containers for our user services and provide them on our systems. I think it’s on the order that 20% to 30% of our users are now running containerized on our big systems.

Gregory Kurtzer:

Forrest, what’s your take on this?

Forrest Burt:

While I was at my previous institution, it was a very small university. We were associated with a national lab and a major kind of energy company for a state. We had a few outside institutional users also interacting with our clusters. What I by and large saw was that we were doing standard module file type management of applications. While I was there, when I started around 2019 or so, we were looking into solutions like Singularity, SPAC, that type of thing. And that was where we were starting to make our first forays into actually using containers to deploy stuff out for users. I would say that, for the most part, I saw just traditional module file type stuff.

It’s probably on the order of 5% of the users with containers and stuff. It was mostly things I was building out that were just kind of custom environments for certain researchers that would otherwise be difficult to represent within a module file system. So yeah, in my experience, at least from a small institution, it’s mostly still on the module file system. I would say it’s 0-5% of people deploying containers regularly in that environment as far as users go.

Percentages on Containerization [00:37:37]

Gregory Kurtzer:

Thank you. I want to run through the rest of the group here real quick. I’m going to ask you to spit out the percentage, if that you would guess, so Jonathon, what percentage of containerization that you’ve seen on systems?

Jonathon Anderson:

I don’t know what I started seeing in actual running workloads, though, when new workloads came in and when at my last academic appointment. When new requests came in for help getting the software set up, our support guys turned the corner. Instead of trying to fold all of those dependencies into a central module stack, it was easier to fold it into and build it up as a container and hand that to them. I’d say, well over half, maybe even 90%, of the workloads that people were bringing in, if it was more than make install in their home directory, they were putting it in a container. 

Gregory Kurtzer:

Got it. Very cool. Gary, what about Berkeley?

Gary Jung:

I don’t know the percentage of users, but we see a lot down at UC Berkeley for a lot of people making the transition to life sciences off their laptops. Then we’re also seeing it on the other side. We just recently started hosting the computing for the Joint Genome Institute, and they’re also containerizing their work, as Trevor mentioned, so they can make use of multiple resources. And so it’s almost a type of high throughput computing, the way they’re using it. That actually uses up a lot of computing resources now. Now they have like a whole automated thing going.

Gregory Kurtzer:

Got it. Very cool. Griznog.

John Hanks:

I’d say less than 10% or 15%, and probably the vast majority of that containerization is driven by the container being the best way to install a specific application or the application being delivered as a container.

Gregory Kurtzer:

Got it. Dave.

Dave Godlove:

As you know, the NIH was also an early adopter of containers. I think it’s good to separate out container users between consumers versus producers. Maybe 30% of users or more are either creating their containers or downloading and running containers that are pre-built from the Docker hub or some other registry. The other thing, once again, as others have mentioned, is that the NIH staff supports around a thousand and growing applications. They’ve really started to install applications primarily as containers and do so in such a way that users don’t even realize they’re using containers when they use these applications. That’s become 70%, 80%, maybe even more of new applications, which are being installed on the system by staff members, are being installed as containers within Lmod, underneath Lmod, and still as part of the module system.

Adoption of Containers in EDA Space [00:40:59]

Gregory Kurtzer:

So it sounds like a similar use case to what Jonathon was saying as well. Patrick, how is the adoption of containers in the EDA space, just percentage-wise? 

Patrick Roberts:

Percentage-wise, there are only two major vendors, which are adopting containers aggressively. That’s only because they’ve been forced to by the industry. The EDA industry wants containers because the overlapping needs and requirements of the software are so painful. It’s hard to give an accurate percentage simply because it can depend,  project to project, chip to chip, what you’re going to be pulling in, and process to process, what you’re going to be pulling in from a software vendor perspective. That can change, but from the administration and architectural side, there’s a huge push toward containers. Again, for that same reason, EDA traditionally has to keep ancient Unix systems running for years and years and years, well past their expiration date. And that’s simply to have a fully supported from the vendor perspective environment.

And if you can do that through the usage of a container, then that’s amazing because then your actual compute environment can proceed and move forward with performance enhancements and security fixes and things like that. Then you have the container to hold the antiquated OS inside of it, which works great. The only problem is a container to container communication can sometimes be a problem, which I’m sure we’ve all experienced to some extent when you’ve got a product from one needing to be consumed by another piece of software. And how do you have a container that calls another container and then feed back to that first container, especially when you have an interactive UI inside that first container? And the only way to truly do that well is through using a job scheduler and network pipes or something along those lines or stacking containers on top of each other with Singularity stacking.

There are weird ways to get around it, not to dive too far into the nitty-gritty of the problems with adopting containers. But I mean, that is what it comes down to: from an architectural and administrative perspective, it’s difficult to adopt when you don’t control the entire code base. And that’s, again, with homegrown applications, I think it’s fairly easy and highly recommended to adopt containerization, simply for repeatability and reusability and infrastructure as code needs. But then, when you have to deal with vendor software, that’s when things get interesting because then you have to push those vendors to make things work right inside a container. And that’s kind of a mixed bag depending upon the vendor.

Once the industry fully pushes hard toward that, we’ll see a change come over the software vendors. And over the last two to three years, that’s really started to happen in the EDA space. Some of the big names are starting to wake up and realize that we better start doing this container thing. And because it also enables hybrid cloud on-premises, off premises, when you’ve got containers that work, regardless of what the underlying operating system is or has, things become a lot easier when you’re doing that. 

OpenHPC Cluster [00:45:14]

Gregory Kurtzer:

You nailed a really important point: so much of this is dictated by software vendors. I’m hoping that over time, they will see that containerization is a very good mechanism for transporting their applications around and then supporting all of the interfaces they need underneath it. It can be used as a solution to enable bit rot  in a matter of speaking but in a good way. Jeremy, what are you seeing on the OpenHPC side?

Jeremy Siadal:

OpenHPC has had container availability for a while in terms of its software stack, and Intel has made containers available for a number of applications, as well as its tools. And if I was to take a wild guess at a number, I would say probably between 10 and 20% right now of containerized applications. I’m also supporting a team building a large Werewulf 4 OpenHPC cluster, which will be 100% containerized. That is the way they keep the compute nodes relatively clean, that they expect people to come in with a containerized application and run it and then clean up after themselves.

EFA on AWS [00:46:43]

Gregory Kurtzer:

I was hoping that containerization would be the magic pill to help enable turnkey HPC, but we still have a ways to go before we get there. We’re still going to be thinking about locally installed applications and how we are going to manage that,  tools like Spack and easy build merging with things like environment modules and Lmod, which are still critical tools that we’re going to need to be leveraging. I’m going to close, and I know that we still got like 15 minutes-ish, but I’m going to close with, and there’s a great question, by the way. Before I close, I think we may even have a few more questions. Patrick, you mentioned hybridization with containers from on-prem to cloud. This is an example of a specific question regarding EFA on AWS: “What would you guess to be the percentage of applications supporting Elastic Fabric Adapter (EFA) from AWS?” I’m not incredibly familiar with use cases running on EFA, specifically in AWS. And I don’t know if anyone else is more familiar than I am. I’ll call out Forrest because I think you’ve been working on some of this, but if not, tell me to be quiet and pass the mic.

Forrest Burt:

So for anyone not familiar, the elastic fabric adapter is something that AWS provides to give RDMA capability to their compute nodes and kind of doing operations over there. You can get fast networking, but by my understanding, the FA adds the remote direct memory access component of it and leads to a lot of the performance gains that you expect out of it. It’s not like InfiniBand under the hood. It’s a little bit of a different setup that Amazon has put together. But by my understanding it still presents to a system, like it is based on kind of an InfiniBand interface, like with verbs and stuff like that. This is still something that I myself am working on here.

I would say tentatively that most applications that can utilize an InfiniBand interconnect would be able to use the EFA because, on a programmatic level, I believe the way they’re accessing that is very similar. And what I’ve seen is like, for example, I’m working with LAMMPS at the moment. All it took to get LAMMPS up and running within a container in an environment on AWS with EFA was installing the EFA into there and then building against it. And I’ve done something similar with GROMACS as well. The bigger consideration, I think, in regards to this with containers is making sure that this is the network stack that’s accounted for there, like making sure that you have this cloud version with the EFA installed in it.

This touches on something I’ve started to find with existing container registries. Many existing containers are made for HPC, but these containers are built with just InfiniBand stacks in them. So there actually becomes a modification that you have to do to get the EFA running inside of some of these available prebuilt containers. And in a lot of cases, they’ve compiled their applications against, like, the open MPI. Because they’ve already built their applications against a different version of open MPI, for example, than the elastic fabric adapter will want to install and build against,  a lot of the times, it becomes necessary to either modify these containers or, more reliably, rebuild the applications from the ground up in a configuration that expects that they will end up being run in the cloud with the EFA and stuff. There are a lot of considerations there. In general, I would say most applications can support it. It’s just a matter of installing them in such a way that that’s what they will be expecting to use as their interconnect and kind of network stack.

Gregory Kurtzer:

Does OFED support EFA?

Forrest Burt:

I’m not sure, actually. Okay. I’d have to look into that a little bit more.

Cost of HPC in Cloud vs. On-prem [00:51:12]

Gregory Kurtzer:

Another question: “Thoughts on costs of on-prem versus cloud HPC solutions?” There are two questions, the first one and then the second one: “Will there be a point where it’ll be cheaper to do HPC in the cloud versus on-prem? I want to take a stab at the second one, and then I’ll open it up to anyone who wants to jump in. At Berkeley, in my role at Berkeley, and it’s funny with my previous manager sitting here on the panel with me, but going back about 15 years ago, we started seeing a lot of requests from upper management to figure out: what is our cloud strategy, specifically for high performance computing. And this has come up a number of times over the years, and I’ve heard many people describing that this continues to come up.

Sometimes, it makes sense to do HPC in the cloud versus on-prem. And typically, in my experience with the models that I’ve seen and run, it mostly has to do with when you’re not running dedicated resources. If you’re running dedicated resources, it makes more sense to do that on-prem, but if you’re doing bursting or inconsistent high performance computing use cases, in many cases that makes more sense to do up in the cloud because you don’t have your equipment purchases and management of that equipment that you must maintain. You can run your HPC for one month and then for the next 11 months, you’re not paying anything. That’s my take on it. Are there any other points? Does anyone else have a counter-opinion to that?

Jonathon Anderson:

I don’t have a counter opinion, but we had a similar kind of top-down mandate to investigate whether we should be doing HPC in the cloud and did financial modeling for it. And you could make an argument for it with compute, but it was always storage that just made it completely untenable. The minute you started talking about having as much storage as we had available on-prem for long-term capacity as well, and in particular, scratch, it just didn’t make sense. What I like to say is that the cloud is really good, not because it scales up, there are reasons for that, especially in business use cases, but it’s good in this realm because of how it scales down. You get the full benefit of a crazy production environment, even if you’re only running a single core job and don’t have to deploy any support infrastructure. Exactly as you said, bursts is a great use case for it, but the minute you start talking about high utilization even on compute, it’s always going to cost less to deploy locally, whether you have the staff for it and the floor space for it and everything that goes into it. 

Data Is a Limiting Factor [00:54:13]

Gregory Kurtzer:

And I think you also touched on the first part of that question, too, which is the data. In many cases, the data will be the limiting factor for where you will run. You’re going to follow the data. Griznog, I think I accidentally cut you off there.

John Hanks:

I have a long list of biases against the cloud, but I’ll pick my absolute favorite one. I think the one that gets the most overlooked is that every dollar spent in the cloud is a lost opportunity cost to have something that you have locally that you can use later. That ties into the thought process behind the cloud. In our environment specifically, and I’m involved in a big migration back from the cloud right now, that’s super painful. The Biohub pays and does a good job recruiting the best scientists possible to do research. I’m involved in a big data migration back, where I realized after listing all the files we have in Amazon, I had just spent probably a thousand dollars just listing files. We have a lot of data up there, and since then, everything I do, no matter what I do in this migration, move a file, bring a file back from Glacier, everything I do has this back of my head context of:how much money did I just spend? 

When you’re trying to get the best researchers possible to do the best science possible, you don’t want them devoting their brain power to wondering how much money they just spent. You want them to focus on the science, and the minute you move into the cloud, because of the cost model, whether it was cheaper or more expensive doesn’t matter, you’ve now made that a thought process that everybody who uses your resource has to go through. It’s why I hate cost recovery in general, for scientific clusters, for cluster support research. Some of it’s necessary. I get that, but in general, using the cloud means everything is done based on how much money did I just spend? Not “should I do this experiment?” but “can I afford to do this experiment?” And spending that money on-premise, you put the hardware in the closet, corner of the lab, whatever, you could run on it for the next ten years, and it’ll still do what it did when you bought it. And you don’t have to ask yourself how much did I just  spend on that experiment when you run it.

Gregory Kurtzer:

It’s interesting to look at it from the scientist’s perspective in terms of managing costs. Gary, I didn’t mean it in a bad way at all. Gary was the best boss to work for.

Gary Jung:

Oh, no, that’s fine. It’s fine. 

Gregory Kurtzer:

You never forced a cloud mandate on us, but we definitely had to deal with this together.

Gary Jung:

I just wanted to chime in because it’s been a while since we’ve done that cost analysis. I had to update it about six months ago, and we have 20 years with the data center, so I know all the investments – chillers, cost of the maintenance people, and hardware that’s gone in there. So just costing our workload in the cloud, if I was going to give it to you in round numbers, including academic or institutional discounts, it’s roughly about three times as much to run in a cloud. Then we modeled it, taking into account things like reserved instances and every kind of trick you can do to get the discounts by committing more.And then, you can get down to maybe about two times as much. 

There’s a paper out there for Fermilab. People look for it. They did like running when the spot prices dropped low enough, and you could see that the best that they could do with even preemptable jobs was like one and a half times. That reconfirms what most people on this know. But it’s not to say that we don’t use the cloud because our use of GCP has tripled in the last two years. There are a lot of tools for people, a lot of easy-to-use tools, which attract people, and it works well for those things. So I’m not against the cloud. All I’m doing is just stating the cost.

Running Gaussian in a Container [00:58:58]

Gregory Kurtzer:

Great question from Brent: “Does Gaussian run in a container yet?” I actually haven’t tried to run Gaussian in a container yet. Has anyone else? Brent, we’re going to have to get back to you on that one.

John Hanks:

I can’t see any reason why it wouldn’t run in the container. 

Jonathon Anderson:

My only concern would be if it’s licensed to run in a container.

Hybridization [00:59:22]

Patrick Roberts:

There you go. One thing I wanted to make a point of regarding cloud push for HPC is –and I think that Griznog kind of hit that point home about data movement and things like that – that’s where others also brought up some really good points regarding this – I think that’s where hybridization matters because being able to run your workload when you’ve got available hardware, available instances, and available licenses locally are great. But it’s those burst instances when you need, well, shoot, I really need an extra 10,000 cores. Well, what’s the infracost of that? What’s the supply chain on that? How long will it take me to get that? How much power do I have? How much cooling? How much do I have racks open? Do I have all of the logistics to get that additional 10,000 cores?

When those additional 10,000 cores aren’t going to be needed three months from now, that’s when it makes sense for cloud usage for that burst need. If you have the facilities for hybridization set up, whether it be block level or on-demand block level data transfer between on-prem and off-prem, that is key to fixing the data movement issue: essentially, local scratch and then output is then block-level transferred back to your environment on-demand. That’s really what makes the cloud work for you in an HPC sense rather than against you, because you don’t use it for your day-to-day jobs. You scale your on-prem capacity to meet your day-to-day jobs. But then you use cloud to meet your burst needs, your burst functionality, because I think all of us probably have had that instance where develop or in my case, it’s a designer, in many of your cases, I’m sure it’s a scientist who says, “Well, I have this thing, and it’s going to take this amount of resources, but it’s really important, and I have to get it next week.”

And how do you get at that? How do you take care of that, while not bursting your budget for the year on hardware and then having hardware sit idle? And that’s the worst when you’ve got a beautiful infra that you’ve built that’s just amazing, and then it’s sitting idle. Or licenses that you’re paying millions of dollars a year for, that your designers or developers aren’t utilizing those licenses, but you’re indebted to and contractually obligated to those vendors to hold those licenses. If you give them up, getting those licenses back in the next fiscal year is so painful. Burst matters a lot, but then the downside is: how do you get vendors to license that intelligently to you so that you can do that without breaking your bank? Again, those additional 10,000 cores, you can easily spin them up in the cloud, but then how do you pay for the licenses to run the software on those 10,000? When you’re on a six-month or a 12-month license cycle with a software vendor, that’s again, one of the other things to bear in mind is: how do you get the vendors to work with you on burst capacities?

Leveraging Cloud Resources [01:03:18]

Gregory Kurtzer:

Great points, comment from Stanford: having a researcher say, “‘I’m too scared to ask a question about my data cause it will cost too much’ is not a good way to do research, maybe.” Griznog, this is a hundred percent exactly what your point was right there. I completely agree: not a great way to do research and to encourage research, but to Patrick’s point and Gary’s point, I think there are good ways that we can be thinking about how we leverage cloud resources for high performance computing and what we learned over the last hour. Well, first, we learned I’m not a great panel moderator because we’re overtime, and we didn’t even get to a quarter of the questions that I wanted to get to and where we wanted to get. So, we learned a lot of other stuff as well.

This is a bigger topic than just  one panel discussion. So what we’re going to do is we’re going to extend this. This will become a multi-part series, and we’re going to work with everybody on this panel and see if we can have a few more follow-ups on this topic. We got to what problems we are trying to solve. We still have a few more aspects, but let’s continue this discussion, maybe next week. Perhaps it’ll be in two weeks from now, whenever everyone’s schedule aligns, but we will work on that. And I’m going to channel Zane here: like and subscribe, follow, keep track of what it is we’re doing, so you are notified on the updates for this. Zane will hopefully be back soon because he’s much better at this than I am. Until then, I want to thank all my panelists and everybody for watching this huge amount of awesome information we went through. I’m expecting lots more to come with this group, so thank you very much, everybody. We’ll see you next time. Bye.