Research Computing Roundtable – Turnkey HPC: Software Stack

Webinar Synopsis:

Speakers:

  • Zane Hamilton, Vice President of Sales Engineering, CIQ
  • Glen Otero, VP of Scientific Computing, CIQ
  • Jonathon Anderson, HPC System Engineer Sr. CIQ
  • David Godlove, Solutions Architect, CIQ
  • John Hanks, HPC System Assistant, CZBiohub
  • Jeremy Siadal, HPC Systems Engineer, Intel & OpenHPC
    • N/A
  • Alan Sill, Managing Director, High Performance Computing Center at TTU
  • Chris Stackpole, Turnkey Solutions Integrator, Advanced Clustering Technologies
    • N/A

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Good morning, good afternoon, and good evening wherever you are. We welcome you back to another round table with CIQ. Today we are going to be talking about software and the HPC environment. We have a great panel of people with us today. Go ahead and let them join. Excellent. We will probably have one more here in a minute. Maybe. Maybe not. I’m going to go ahead and start and have everybody introduce yourselves. I know I think everybody has been here before, but let’s start with Jeremy. If you would introduce yourself again. Oh. Can’t hear you, Jeremy.

Jeremy Siadal:

There we go. Can you hear me now?

Zane Hamilton:

There you are. Yeah, absolutely.

Jeremy Siadal:

Okay, great. Jeremy’s Siadal with Intel Corporation. I also represent the OpenHPC project. I’ve been a HPC systems software engineer for going on 20, some odd, years now.

Zane Hamilton:

Excellent. Thank you. Chris?

Chris Stackpole:

I’m Chris Stackpole with Advanced Clustering Technologies. We are a turnkey solutions integrator for HPC systems.

Zane Hamilton:

Thank you, Chris. Dave?

Dave Godlove:

Hey everybody. I’m Dave Godlove. I am a solutions architect here at CIQ. I’ve got a background at the NIH working as a scientist and also as a support scientist.

Zane Hamilton:

That’s excellent. Thank you. Alan, welcome back.

Alan Sill:

Thanks. Alan Sill, managing director at the High Performance Computing Center at Texas Tech University. One of a group of co-directors of the National Science Foundation’s Cloud and Autonomic Computing Industry University Cooperative Research Center. My background is in particle physics. I just got too close to the computing flame.

Zane Hamilton:

Thank you. Jonathon?

Jonathon Anderson:

Yeah. My name is Jonathon Anderson. I am also a solutions architect here with CIQ. I have a background in HPC systems admin and Systems engineering. Mostly at central services organizations.

Zane Hamilton:

Thank you, Jonathon. John? Griznog?

John Hanks:

John Hanks. HPC system assistant man and the life sciences mainly. Mostly here to tell the kids to get off my lawn.

Zane Hamilton:

Appreciate it, John. Let’s go ahead and dive in. I think there are a couple different ways you can look at software and the HPC environment. Ah, we will let Glen introduce himself. I think everybody has seen Glen before, but since he slides in late, we are going to make him sing a song or do something fun. I’m kidding. Glen?

Glen Otero:

I was going to introduce myself with a performance dance, but I probably don’t want that. Glen Otero, recovering scientist in the life sciences, bioinformatics. Tied to the computing director at CIQ for genomics, AI and ML. I am a borderline boomer, so you can all get it. Just get off my lawn.

How is Software Changing? [2:56]

Zane Hamilton:

Appreciate it, Glen. All right. Back to what we started before Glen interrupted this and came late. There are a lot of different ways you can look at HPC software. From a systems environment, from a system admin perspective, and from a user perspective. I think it’s interesting the way that things are progressing. I know Jonathon and I talked about this quite a bit. The way containerization might be helping, changing, or making it more complicated. That’s kind of what I wanted to talk to you guys about today is how do you see that landscape? What does that look like? We have been doing the same thing for a long time, but things are definitely changing. You guys have different experiences than all of us. I’m curious to hear from each of you, like Jeremy. From a software perspective, what are you seeing today?

Jeremy Siadal:

Well, actually you brought up a really interesting point on containerization cause, one of the systems I’m supporting right now is built entirely around containerization. That is the expectation that the users will provide their own containers with their own jobs. They package up their own software, they bring it, and what they get is a set of nodes that are really absolute bare minimum installations. Then they are expected to run their containers on top of that and then the nodes are completely wiped. Of course they are running with Docker containers in order to keep the security intact. It’s not only just wiped. You wipe the systems and reboot them.

Zane Hamilton:

That’s very interesting. Thank you. Chris, what are you seeing today when it comes to software?

Chris Stackpole:

I have an interesting perception on it because a lot of our customers are looking at the lifecycle of their cluster and the research that they are doing impacts now. They don’t have a whole lot of long-term stuff. It’s built for the cluster. They are built right now and then they will run that for the life cycle. Then next cluster, build a new system. I’ve also done a lot with researchers that are writing papers that may be 10 years before the life cycle of the paper really starts to be challenged and reviewed. In that case I’ve been that admin where I’ve been asked to install applications and software 10-15 years old on a modern HPC system, in order for them to verify the results that they ran then because it’s now being challenged. Can they get the same results from that paper? That is a very hard problem to tackle when you’re being asked to pull in libraries that haven’t been supported since row four or five. A lot of the containerization focus that I’ve done in the past has been with some of those long life cycle applications. With the hope that 10 years from now when those same researchers are being challenged on the papers they just wrote this month. It’s going to be a whole lot easier for us to verify those results.

Legacy Libraries [6:09]

Zane Hamilton:

Sure. Before I move to Alan and John. I want to ask you, Chris. Whenever that type of thing is taking place today, is it really your responsibility to maintain and keep those legacy libraries around? Or are you pushing that back on the researcher and the end user? How does that go?

Chris Stackpole:

Well, the number of times I’ve thought, “man, past me was such a jerk. Why didn’t he do this better?” Is pretty high. So a lot of it is a, how do I make my job easier in the future? I mean, that is definitely a motivator. Working with a lot of these researchers, we are seeing where the long tail of HPC is growing longer and thicker. So people who 10-15 years ago didn’t even know how to spell HPC, are now being asked to run their jobs at HPC because that’s the only place their data can be big enough to run. So they aren’t computer scientists. They don’t really have an interest in learning HPC. What they are good at though, is their job and their specialty.

They know their code, they know their research really well. It’s from an admin point of view. How do I help them so that when they have this issue in 10 plus years where I have to revisit this code. Can it run? Can I help them? Or anybody who’s trying to verify their code ensures that it’s actually working. One of the issues that we had, a real example, was where they were being challenged on some of their papers because they were very ahead of their time when they wrote it years ago. I was trying to rebuild library sets from el5 and we were actually finding some variation differences depending on if we ran it in a 32-bit container VM or 64-bit VM. The libraries did not behave well because of the libraries at the time.

Trying to get the exact same version of the library to get exactly the same results was very challenging. So the hope is that I don’t really care what it looks like in 10 years, but having it packed up inside an image type of a container. There is a high likelihood that I will be able to better match that library set for them. Because it took me weeks and weeks to recompile these old library sets that haven’t been supported in so long and trying to get them to match exactly right. For the end user researcher, that’s too big of an ask. So it’s and we hear this all the time, especially the last two years, the reproducibility of papers and science. We actually reproduce these papers. It keeps coming up because somebody has all these crazy assumptions that are made and then nobody else can reproduce them. But yet that impacts the media, that impacts decisions. We’ve seen a lot of that where rushed research or impartial research made bad decisions that impacted a lot of people. We want to make sure that when we are doing reproducible science and have reproducible data, that it’s as reproducible as actually possible. A container really helps in those environments for those researchers.

Containerization and Software [9:49]

Zane Hamilton:

That’s great. Thank you for that, Chris. Alan, I know we have talked about containerization, some. We have talked about software, some. So give us your view.

Alan Sill:

Yeah, so it’s sort of like, I’ll make a food analogy. It’s when I travel to Italy, I’m always shocked how quickly Italy totally reprograms you as to what you think food and drinks should be. Inside of a week you’re discussing the fine points of Ligurnian vs Tuscan olive oil and think that coffee should come in those cups and stuff. When you start to use containers, you start to think why isn’t everything in the container? Let’s tick off very quickly some of the reasons that some have been mentioned. Reproducibility though, let’s come back to that. You can even get a DOI for your image. You can store an image like a single file. It’s storable as opposed to a long list of building instructions.

You can. It’s portable. You can run on different machines with the same image. You can run images with different operating system versions than your operating system version as long as they share certain kernel characteristics. Keep in mind the fundamental design concept of a container is unsharing shared name spaces and memory, right? It’s an unsharing technology. It lets you isolate yourself from some details as long as the kernel is compatible. We can come back to that and talk about MPI, for example. You can bring your own software. You can build it. You look at all these advantages, you think, “why isn’t everything in containers?” Maybe CIQ has that in mind, but surprisingly, not all food is Italian and not all food is you know. Not all software is in containers. Despite some obvious advantages, there are some things holding people back and it’s worth examining some of what those are. I won’t make the full list, but I’ll just hint that vendor specific software often doesn’t support installation of containers and so forth. I’ll stop there.

Zane Hamilton:

Absolutely. Thank you. John, I think you’ve probably got an interesting perspective on this as well. Managing some large and dispersed environments.

John Hanks:

Yeah, we. My current environment has been completely different from all environments in the past. In the past I’ve had modules set up and had 1600 plus packages in a module tree to support environments. But I’ve been running this cluster for about a year and we have had almost no software requests whatsoever. Pretty much everything people have done. They’ve brought their own. It’s been a mix of containers and software installs. Mostly Anaconda environments. The extent to which all software is going into Anaconda and containers is, I’ve been flabbergasted by it. I never would have expected it from the beginning. Because my view of containers is, you tell me I swam across the English channel. Well, fantastic, you swam the English channel and then you say somebody else comes along and says, “I got on a boat and got in a lap pool and I swam laps while the boat drove across the English channel.

Well, good for you, but that seems like a lot of extra work. You could have just swam the seamless channel, right? That’s been my historical view. But yeah, it’s. From my perspective, being an admin on the software side is getting easier because more people seem to be self-supporting than they used to. On the occasions when I do have to get involved with installing a software though, software quality overall seems to be going down to me. There. Things are really complex. There’s a great xkcd cartoon where he’s basically got all of civilization and down at the bottom there’s a little tiny piece. It says, “a PyPI package. It hasn’t been maintained in 12 years or something.” All this stuff that’s getting jammed in containers today on the idea that things are going to be reproducible. I don’t think people are admitting that time actually flows forward and things change. You’re containerizing bugs, security holes, and all this other stuff. Is it really going to be valuable? I think to the extent that doing it today gets you a paycheck, great, but in the future I don’t really see people dragging these things out in 10 years to try to reproduce something. In any case, they did. It’s not scientific reproducibility. It’s digital reproducibility and to me that’s a pointless exercise. I don’t really care about digital reproducibility.

Zane Hamilton:

Interesting. Thank you. Glen, I’ll go to you next. You’ve seen some stuff.

Glen Otero:

Yeah, it’s actually. Just thinking about it. It’s been interesting because I’ve been of the mind that yeah, we need to use containers for all the reasons that people have just stated, right? The reproducibility aspect, the fossilization of a moment in time that you can go back and reproduce, but just been thinking about using. Seeing the container used by the bioinformaticians. It’s actually the reverse. They’re actually trying to live in the future. They’re just always downloading the latest/greatest bioinformatics app or the latest/greatest release to rerun their pipeline. Make sure. Find one more variant, validate the research a little bit more that they are on the right track. Trying to pull stuff down that or run something that someone built on Fedora, but you’re running on Rocky kind of thing.

So just using it as a time machine to go forward on a production cluster, but really without much thought of, “oh yeah, I should keep this container around and give it a DOI cause this could be there for reproducibility for other people.” I don’t think they really see past their own lab, their own work. Again, just try to pull down the latest and greatest app and get it to run where it normally wouldn’t on their laptop or on the clusters they have access to. One of the reasons I think cloud also is caught out a lot there as well is just being able to run all this latest, greatest stuff. Again, don’t think about security, don’t think about what’s going to be supported in a couple years. Ditto for all the Anaconda environments, right? You better keep that YAML file around, but good luck rebuilding that environment in just a year, right? I think it seems to be a little like a two-edged sword. I think it’s more of a. Well, I think it’s necessary. I don’t think it’s sufficient.

Zane Hamilton:

John, did you have something you wanted to add?

John Hanks:

Yeah, Glen actually reminded me. I left out my absolute favorite aspect of containers. To me, the absolute biggest positive container is that software developers will often refuse to write decent installation instructions, but if they containerize it by definition, they have written installation instructions. Which I can steal and go use the stuff outside of a container. From that perspective, I do like containers because it motivates people to write decent install instructions that will actually work.

Zane Hamilton:

That’s true. Thank you. Dave, Mr. Container around here. I know you’ve been spending a lot of time working on containers lately and trying to containerize some stuff. I mean, we have been through this before and from a researcher perspective as well. What are your thoughts on all of this?

Dave Godlove:

Yeah, I mean. There are so many interesting thoughts that everybody just had to riff on. I mean. I’m taking notes here, like, “oh, I want to get back to this and I want to get back to that.” I don’t really know where to start, but I mean. I guess two main threads here from what I’m hearing are reproducibility, which we are talking about. Later here and then a little earlier we were talking more about helping users, empowering users and allowing users to install their own software. I guess maybe. I’ll comment a little bit on that. I mean, coming from the NIH, we, the team there, have this really great attitude about the users, about the scientists. That is that these scientists are, but they are brilliant and they know so much about their domains. There is so much to know and so much work to do that they can’t possibly be expected, on top of that to become computer scientists. In addition to everything else they’ve already done.

So it is our job to take these biologists and these folks who are into all these different sciences, and are very deep into them and know tons and tons of stuff that we will never know, and help them with stuff that they might not know and don’t have to. So I wonder, are containers helping with that? They can help with that and they can hurt with that, right? We can. Containers might make scientists’ jobs harder down the road because we might say, “okay, here’s a cluster with a container run time, have at it,” right? Some scientists might be fine with that and be fine with building all their own containers, but it’s just offloading a lot of the sys admin work or maybe a lot of the applications of port work off to the scientists.

Is that okay to do? Is it not Okay? I guess it depends on the situation. Then another thing too is that when containers first hit HPC, we thought that that’s what they were going to be. All scientists were going to end up building their own containers, using package managers, and such inside the containers to build all their own software and have everything. That’s not how it’s turned out at all. How it’s turned out is that, other folks, third parties build containers and put them up on Docker hub. End users just download containers. They don’t actually build them. Maybe in that respect, it actually is helping the end users quite a bit. Just a lot of interesting things to talk about and think about. Then on the reproducibility side of things. I mean, one thing to note is for scientific reproducibility, distribution, collaboration, and things of that nature. Often another lab doesn’t want to take somebody’s workflow/data and just press a button and see that the same figures pop out.

I mean, it might be reassuring to some extent, but they want to. What they want to do is they want to take their own data and maybe get the code base and they want to modify the code base a little bit and push it forward, right? They want to be able to look at this stuff and edit the code. Use their own data and do things like that. Do containers help with that? Because they might help with that from the perspective of being able to push the button, run the workflow, and get the result. But then after you’re done, you’re like, “okay, now I want to crack it open and look and see where the code is at and what’s going on and how do I modify it?” If you’re not a container expert or a file system expert or whatever, it can be a pain to get inside that container, start digging through it, and figure out where all the bits are and what makes it run. I don’t know. Just random thoughts based on everything that we are just talking about. I got a million more, but I’m going to cut it short in the interest of letting the other people talk.

Zane Hamilton:

That’s interesting, Dave. I hadn’t thought about having to go back and actually change the code. Once you have to crack open a container. That is an interesting point. Thank you, Jonathon. I saved you for last.

Jonathon Anderson:

Well actually I see Chris raising his hand. I would like to hear.

Zane Hamilton:

Oh, sorry Chris.

Chris Stackpole:

No worries. Just on that point. There are actually. Something I’ve witnessed a lot with some of the researchers is that when they are trying to reproduce they do want that first round to be, can I reproduce it exactly as it was published? Because there’s a huge difference between attacking a paper because the data or assumptions are wrong versus that the mythology was wrong. If I can prove that the mythology with the data matches correctly, then all of a sudden I can say, “okay, your ideas and concepts, these are what we need to actually focus on.” If you had looked at this problem from this view instead, that makes a huge difference than just saying, “well I got different results and so therefore your whole paper is worthless.” Which actually does happen quite a bit when people can’t reproduce the results. Part of being able to get that initial click of a button, download it, and get the exact same results means that I’m now at least mimicking exactly what the paper author had and I can reproduce their results? Now I can go ahead and critique what it is about the paper that actually makes sense. How do I adjust this? How do I tweak this? Can I look at this problem from a different view and get a different result?

Zane Hamilton:

That’s a good point, Chris. Thank you. Okay, Jonathon. We have talked about this quite a bit. We have talked about this quite a bit lately. I know you have some opinions on this and why it is good for containerization?

Jonathon Anderson:

Yeah, so I compare this moment in HPC and scientific software containerization to, and this isn’t necessarily my original thought, but the thing that’s most useful to me is to compare it to the advent of data center automation and infrastructure as a service. We saw with the large cloud providers, AWS, Google, Azure and all the others. Virtualization is an interesting technology, but the game changer there is that it takes something that historically only a subset of people could do. The sys admins in your organization who had the rights to the data center or who knew how to do it. It takes something that only a small group could do and turns it into something that anyone can do. Not that everyone does, but it simplifies the workflow of setting up a server to the point where now the sys admins are, if they are in a virtualized environment, doing things very similarly to any random person who might want to spin up an instance in an infrastructure as a service environment.

I see containers that same way in this conversation. I’ve been doing a lot of work figuring out the best way to containerize MPI applications, for example. My thought process on that is not and now the sys admins or the support people will never install another piece of software again, but that now anyone can. Someone mentioned earlier that it hasn’t been that all of the scientists just build their own containers, but that third parties do. Because anyone can. That doesn’t even just mean you’re users. It means you can run containers from wherever and now you can draw value not just from the people running your cluster, not just from the people that work on your cluster, but from anyone who’s building any container of a certain type. You can pull it down and run it.

One of the things I highlight in some of my MPI container demos is that there is nothing special to the environment. I’m running this container in this container. I can just pull it down like I pushed it up to Docker hub. Cause why not? I can pull it back down and run it, but as long as I had an Apptainer on a different cluster, I could run it there too. There’s no dependency. I think the biggest win for me. I totally agree by the way with John’s pointing out that a big value is the containerization of the install instructions and self-documenting how a container works. That breaks some of my brain because I’m a systems guy and I really want things to be declarative and very well specified. The fact that containers are really a kind of fancy shell script dumped in a file bothers me a little bit.

When I get over myself enough for that, I realize that’s how most people share sys admin knowledge when they are not sys admin anyway. So it flattens that landscape and lets everyone talk that same language. Here’s just the raw stuff that it needs to do and then you package that up and do it once and you can hand it around. I think that’s really cool. I’m looking forward to more of that. Not that I think that all of the users on all of the clusters will always containerize all of their software, but since anyone can. It only has to be done once for anyone to be able to use it and get value out of it.

Alan Sill:

Before we go on to the next topic, Zane. I wanted to highlight a couple of things that we have mentioned. Are there rocks under a stream if you’re kayaking, right? The MPI topic is better. It’s in the sense of Monty Python. This you turn me into a newt getting better. It’s still a problem. The MPI standard itself, the whole range of them is trying to converge towards interchangeable MPI, but we are not there yet everywhere. As I said before, anything that goes to the kernel to make its call, you have to ensure compatibility between the kernel and the containerized version. If you’ve had better luck moving Apptainer containers around, I want to hear more about that. Maybe you guys can post about that.

The this. Through this thread, there’s been a sort of a line about reproducibility and I put into the private chat a comment that I’ll just say here publicly that friends of mine who study reproducibility say that it’s far inferior to transparency. As Jonathon said, “the great thing about a container is the recipe is right there in front of you and the transparency is there.” For example, there’s been a lot of security concerns about software bills of materials. Software SBOMs, right? Sort of a bad acronym, but software bill of materials supposed to ensure that if you have a security problem and software chain, you can figure out what’s dependent on it. But usually that concentrates entirely on the built software and not on the operating system dependencies that were hidden there like glibc and so forth. That the software is equally dependent on but won’t show up in an SBOM. A container in principle can do that, but you have to go back to that rubric. Are you getting transparency or are you getting researchers downloading black boxes? Because they say they have a piece of software in them. If you can do what Jonathon said, great. Cause you have both reproducibility and transparency, but if people are just treating this as a black box and you’re carrying security problems along with you and so forth. You haven’t won.

Simple Route to Containers Via SPACK [29:36]

Zane Hamilton:

Thank you for that, Alan. We do have a question. I think we are going to pop it up. I think Greg responded to it, but Martin asked “can a researcher get a simple route to containers via something like SPACK?” I know Jonathon, we have talked about this as well. I think it’s interesting for you guys. I mean, Greg has answered yes. I think this leads into the next question ahead for you guys, but I’ll let you answer this one first. Jonathon, since we have talked about this recently, I’ll let you.

Jonathon Anderson:

Yeah. I’ve experimented with this a little bit. Absolutely you can start a base container and as part of your build, install SPACK in it and build software in it. I have definitely done that. One of the downsides to doing that is you lose a lot of the caching benefit of SPACK. Building software takes a long time sometimes, especially these scientific applications. Part of building containers is often rebuilding them over and over again. If you’re throwing away your SPACK build cash, every time you’re rebuilding a container. You’re losing a lot of the benefit that you would get out of something like SPACK. I personally more recently have been using the OpenHPC project for this instead. Using the software that’s already compiled there, rather than building software unless I absolutely need to.

I also, as an experiment, nothing production has come out of this yet, but I took advantage of Apptainer’s support for sandbox containers where you can have a mutable directory instead of a zipped up single zip image and used that as a base for my SPACK builds. Then built sub-containers off of that. You can, but if the point was to simplify it in the beginning, you’re often the weeds if you’re doing something as bespoke as that. My general recommendation would just be to use what people have compiled already.

Alan Sill:

I would. Let me push back on that just a little because we are. We use SPACK a lot here. I don’t know if people have noticed this, but the Exascale Computing Project has been publishing a single container built automatically through ci/cd processes regularly. I don’t know if it’s nightly or what frequency or if it’s on demand, but of all their software. So basically all ex. They have this thing called E4S. The Extreme Scale-scientific Software Stack. Oh, I actually remembered it. So E4S. You can actually go there, you can hit the downloads button, and you can download a whole container. I’ve done this. The software, some of which is extremely complex and difficult to install yourself, will run out the box on your machine. Okay, what are the downsides? Well, it’s a 50 gigabyte container.

Beyond that, there is a very strong qualification for any software project to get into the E4S stack. I would say three quarters of the software that our researchers depend on daily here isn’t in that stack. This is software that’s been qualified to run on Exoscale machines. Let me wind back just a little and say that’s an interesting idea to build such a container simply for versioning and testing as you just hinted. Suppose we are going to change the SPACK version or we are going to go to a different operating system. It would be great to have a ci/cd process that built all of our apps in a container that we could use for testing. Whether or not we actually intend to download that one single container to all of our worker nodes for any given revision.

So I’ll just summarize this by saying one of the big potential advantages of containers, especially if we can use the sandbox style ones, would be for the difficult job of version testing. If we have a certain set of built software that we support on our cluster, it would be nice to get a head start on the next operating system release, the next SPACK version, the next MPI, whatever. You can do this with single bits of software but SPACK. It’s harder to do the whole thing.

Why is Deploying Cluster Software Complicated? [34:02]

Zane Hamilton:

Thank you Alan. I’ll come back to a point you made in just a second, but we have a question I want to ask real quick. Then I’m going to get back to that Alan. After 30 years of HPC style or of Beowulf HPC style, why is deploying cluster software still complicated? Or is it really just the management over time and image drift that’s complicated? Anybody want to jump on first?

John Hanks:

Oh, I’ll offer my opinion on that. I don’t necessarily think it is complicated. We. All of my software stacks over time have always been just, I built it and installed it. Including all the complex stuff back when I did real HPC back in the day. It wasn’t complicated to do, but it required a lot of patience and knowledge that I learned by getting smacked in the face over and over again by things like Wharf and Fortran, one underscore, two underscores, three. Whatever you’re going to do to get things to link. For somebody who climbs the learning curve, software is not necessarily hard to install other than to the extent developers do a really bad job of documenting it and giving you what you, the information you need. You have to reverse engineer a lot of it. For making that simpler with SPACK, EasyBuild, and stuff. I’ve avoided those because every time I’ve looked into them, I’ve wound up feeling like they were hampering me from installing software as fast as I could on my own.

The extra layer. If something will compile with configure, make, make, install, I don’t need EasyBuild and SPACK. If something is so complicated that every release is changing the build options, then again, I don’t need EasyBuild and SPACK. It’s gotta be fixed there every time too, right? I don’t need those. But as a general overview of this kind of thing, tools that attempt to take the complexity out of it, I think give people a false impression that they know what they are doing. If you give a doctor a scalpel that makes them a surgeon, but if you give some random person off the street a scalpel that makes them a butcher. Maybe not everybody should be installing software. I’ll just throw that out there as a possible option for the way the world should work.

Zane Hamilton:

That’s a great point, John. Thank you. Jeremy, you’re nodding. I’ll pick on you.

Jeremy Siadal:

Yeah, I was going to say, having installed cluster software for the last 20 some odd years, I don’t think it’s more difficult. In fact, I think it’s a lot less difficult now and it’s not just experience. It’s that there has been a lot of work into simplifying a lot of the tasks. Where I still stumble continuously. It’s the combination of fabric software, MPI, and the application all working together. I think that’s just a lot of, I think John pointed out, it’s software quality. Are the software developers following the standards set forth? Are they making those standards transparent or is their software fully tested before they are actually putting it out? Cause I certainly encounter that a lot. HPC is still very much an open source software project.

I represent a very large open source software project and a lot of the software. I certainly feel a lot of software goes out before, I think it’s ready to go out. That’s, I think a lot of the issues come in, but certainly having to work on like the Warewulf project. That’s, with that in place, I think it’s very easy to install software across multiple systems cause that’s really what it is. Instead of installing software in one system, you just need to install software across a variety of systems. Then obviously getting it working is difficult if you’re talking about multiple pieces distributed by different entities and software developers.

Zane Hamilton:

Sure. Makes sense. Thank you. Yeah, Chris. Absolutely.

Chris Stackpole:

I have mixed opinions on this. I think in a lot of regards it is a lot easier these days than when I first started. Because there are tools such as SPACK to help through some of the really complicated builds. There’s also a lot better communities for even when the software itself is not as simple as just doing make, make, install. The communities there give them a lot better documentation. At the same time there’s also a lot more challenges. Especially if you’re not dealing with the latest and greatest. For example, recently trying to help a researcher out and confirming if his code was still functional. That he was dug up and the last time it was compiled was with 2016 compilers. You can’t find those. SPACK lists them and SPACK fails because when Intel released 1API, they killed all the download links for the other versions.

You can go pester intel, which is what we had to have them do. Then you’re in a support loop because none of the first level support guys have any clue where to find a download for 2016 compilers. You’re just hosed. A lot of that complexity is still really frustrating. There’s also an aspect of if you are doing some pre-compiled stuff, it’s fairly easy. You want to spin up Kubernetes on a cluster for some work. Well, great. You look at the stuff and they say, “oh, grab K3s or Minikube” and you’re like, “well no, I actually need something more than that.” If you’ve tried building Kubernetes, it’s a pain. There are so many pieces. Part of that is because it’s a complex piece of software and I think a lot of HPC tends to get that way.

You do the make, make install and it’s great if you’re running it on a single node. When you want it to scale across multiple nodes working with invapage and ACUTA driver and all of a sudden the complexity of the build goes up way high. So when the community has figured out some of these build scripts and they integrate that into SPACK. It makes it a lot easier. I think that while the target audience may be somebody, just the guy off the street with a scalpel, anybody can do it. I have a feeling that a vast majority of the time it’s just, they are giving it to a doctor to help with it. That there’s still. It’s just the admin who needs to understand how it’s compiled, where it’s compiled, and what options are best to pass to it. The SPACK has a lot of variables that you can pass to optimize and a lot of that stuff you need to know about your system in order to optimize it and build it correctly. It’s just a helper to get through some of the really complex stuff and that’s where the community really has stepped up. I think this is one of the things that makes compiling and dealing with software these days so much easier than it was before. There’s a bigger community that is more eager to help.

Zane Hamilton:

That’s great, Chris. Thank you. Jonathon, I’ll let you have the last part on this and then I kind of want to go back to something Alan mentioned earlier.

Jonathon Anderson:

Yeah, I think one of the things that’s driving improvement in this area is the broadening of the audience for HPC software. Some of that is coming from the supply side where the tooling is getting better and the software is getting better. The software is more applicable to people. But the other is just as applications that we wouldn’t have historically considered HPC applications and use cases become HPC use cases. We’re getting more eyes on it. Open software is antifragile in a sense that it improves the more people experience the pain that is in it. Scientific software has been in a bad niche for a while. Where a research scientist or a computer scientist got something working and it worked in his environment and they are like, “great, it works.”

They publish it and now it’s on the staff at all the different HPC centers to get it to work in their environment. The only people feeling that pain is that initial developer and then the people in the sites. Containerization among other things drives the need to be able to show how it worked, how you got it to work, and to make something else be able to get it to work reliably. When other people are using this software, they are having to do it too. The more people are using the software, the better the installation processes they are going to get. The easier the management of it is going to get. I think that’s going to be the solution. It is more eyes and more hands on it than one tool that will fix it.

When Are Containers Too Much? [43:08]

Zane Hamilton:

Thank you, Jonathon. We lost Alan, but back to what he said earlier, he mentioned a large container and I think it was like 40 or 80 gigs. I’ve heard of some research scientists or some admins actually going and using something SPACK and they’ll tell it to compile every different version of everything that they could possibly think of on every different compiler to try to cover everything. It doesn’t matter what environment it goes into. I’ve actually heard of people having 250 gig containers that come out of this. Then you’re stuck with, I have this massive container. At what point in time is it too much? That goes to John’s point of you’ve made them butchers. Now they have this massive thing with everything in it. Do they need it? No, but they have it. What does that do? I mean, where does that line need to be drawn and how can we help with that?

John Hanks:

I can share something that I butcher frequently. Our module in our software stack, we build every package from CRAN and BioConductor that will build non-interactively. If you load up our R and you count the libraries on any given install, it’ll be somewhere between 17 and 21-2200 packages that are in our R module. It’s mostly because we want to make R. To the extent R can be made painless, completely painless for the people that use it. We don’t want anyone to have to ask us to add a package to it or have to install a package themselves in their local directory. That all leads to problems. That makes our software stack absolutely enormous. Let’s assume before we get into Anaconda, but a single R install for us sometimes will be 300 or 400 gigabytes.

Zane Hamilton:

Is that something that you think? I mean, when you look at a container that becomes difficult to move at some point, right? And when is that?

John Hanks:

That is not containerizable, I don’t think. That never goes in a container. That has to be an NFS mount to a software stack with a module in front of it.

Alan Sill:

Yeah. We’re sort of glossing several related issues that center around what happens with the software regardless of whether you put it in a container or not. Depending on how you’re building and deploying. Let’s just pick for example, SPACK as we have talked about it. One of the things that happens when you build software in SPACK is that you lose the old paradigm of load library environmental variables and stuff. Everything goes in an R path. Everything that has to get built and they are getting better at detecting things that have been built already. Basically everything that’s the dependency of that software goes in the R path and therefore goes onto the file system for that software. It’s pretty hard to distribute that software to worker nodes by whether they are staple or state listed. It’s actually hard to pick up the SPACK libraria and copy it over.

What that means is it puts pressure on your cluster wide file system. The portion that you’re using for distributing software. We actually have a separate setup for that. We’re just implementing BeeGFS. We had a cluster before, but it wasn’t strong enough. NFS is popularly used. That thing is getting slammed. Millions of open file handles. Cause every single thread of every single MPI application is opening dozens of file handles to stuff that’s in your R path. We actually found that the cluster couldn’t handle that at all. Even though regular NFS can. There are implications on how you build and distribute the software independent of whether it’s containerized. One thing about a container is it’s a single CF image when you move it. You’re moving it into essentially local memory. Those dependencies go away. It can be a way of circumventing that dependence on the external, on the cluster wide file system. But at the cost of chewing up your memory or local if you have enough space to turn on swap files. Local swap files can. We don’t. We have very tiny disks.

This is just one sample among many. Python has similar considerations. Independent of building libraries. How you deploy the software R, is another example that you can get, not only huge libraries, but huge numbers of file accesses for every single one of those Python execution points to the .pyc file that’s in there. In that R across your cluster. It can be a huge bandwidth issue.

Zane Hamilton:

Thank you, Alan. Dave, I think you had some thoughts on this as well before I go back.

Dave Godlove:

Yeah. So going back to your original question, you’re basically talking about how big a container should be before you draw a line in the sand and say, “okay, that’s too big.” Let’s figure out how to make your container smaller. I think that what that gets to is best practices. I have a pet peeve when people start talking about best practices because best practices come with an assumption. It’s like best practices for what? Best practices for your application, for my application for what? Containers are a very powerful tool. Whenever you build a really powerful tool and you give it to somebody, they might use it for something that you didn’t have in mind. That is actually. It shows how powerful your tool is. That somebody can go adopt and adapt it to something totally different that you never thought of.

So in summons, I mean, one of the things that I’ve containerized in the past was we built a TensorFlow container. The TensorFlow container needed to be compiled two different ways, one with AVX instructions and one without so that depending on which node it landed on, it would run intelligently. It would either run with AVX or not run with AVX. Not just bomb out and not work. So we called that a fat container. Like a fat binary, right? That has multiple different ways of running. So I just. I hesitate towards putting limits on and telling people that they ought to build their containers this way, that way, or whatever. I mean, cause you don’t know what they are trying to do. Maybe a best practice for your application, is not a best practice at all for somebody else’s application. I don’t know. That’s just how I kind of feel about the whole really big fat container thing.

Zane Hamilton:

Sure. Thank you.

Difficulty Installing Software in Scientific Environments [50:06]

Dave Godlove:

I guess one more really quick thought too. We were talking a little bit about why it is still so hard to install software in scientific environments and this makes me think of the story. I brought my nephew one time to the airspace museum in Washington DC. We were looking at this really early satellite and you could tell that it was handmade. Like it was put together, it had like writing on it, and people would like stamped things on it. Stuff and everything. My nephew was looking at this and he’s like, “this is supposed to be like high tech?” He’s thinking about television programs like CSI and stuff where you’re bringing up computer screens in the air and you’re doing stuff. I’m like, “so I used to be a scientist” and I said to him, “look at this.”

This is what science looks like. You don’t manufacture science often on assembly line. It’s not shiny and pretty. It’s one person or a small group of people hand building stuff. Prototypes that look ugly, but they work. This is why it’s so hard to install software in scientific environments because a lot of times your software is just a prototype. There’s no packaging involved. There’s no. These are people that don’t necessarily write code as their primary job. They are in a lab trying to make their code, analyze some data, they get it to work, and then they publish it. It’s your job to take it and figure out how to install it in your scientific environment. It’s handmade. So that’s my hot take on why it’s still hard to install code in scientific environments.

Zane Hamilton:

That was great. Thank you. John, I think you had something you wanted to add.

John Hanks:

I did, but Dave just prompted something that I remembered. I’m definitely guilty of making fun of biologists who write software. Biologists are terrible at writing software and I’ve had to support a lot of really bad software. The one thing I will say in their defense is at some point they stop and they are done with that piece of software. An actual developer will never stop making it worse. Developers won’t stop. You will never find a developer who will put on their performance goal for their year. I’m done, this is perfect. I don’t need to do anything. I’m just going to sit here and draw a paycheck. Right? They will always keep developing. Somewhere between the grad student finishes their master’s and never touches a software again. Developers develop forever until they make the package impossible to use or work with. somewhere in there, there’s a sweet spot and I wish we could find that sweet spot.

Gentoo and Ebuild [52:48]

Zane Hamilton:

That’s a good point. Thank you, John. I have several questions that are coming in. I know Todd, is it that we are going to pop up first? Oh, so Martin asked, “are any of the panel members? Have any of you guys ever used gentoo?” Like Ebuild?

John Hanks:

Used it in the past. Haven’t in a while?

Jonathon Anderson:

Yeah, same here, though I did think of it in this particular context years ago when we were trying to build software and find a way to automate it. I wondered why someone didn’t just take either Ebuild or ports from BSD and port it for use in scientific software applications. They are quite different in the particular multi-versioning and multi-builds of the same thing. Which is why EasyBuild and SPACK exist now.

Can Ansys be Containerized? [53:36]

Zane Hamilton:

Thank you. Had another one from Todd. There we go. Can commercial software like Ansys be containerized? Hot topic around here lately.

Chris Stackpole:

I would say talk to your legal team and talk to the vendor. I don’t know about Ansys specifically. I know of them. I just don’t know if they would allow containerization. There are third party commercial products that have specific rules for how you can do containerization. Some of them are very public like, “hey, here’s what you have to do to fulfill this.” Others, when I was working with MathWorks for MATLAB, were very concerned about having a MATLAB license that anybody could run. We had to work with our legal team and their legal team to basically say, “yes, you can, but you can’t include a license number.” That has to be something that the end user has to pass through. I don’t know if that’s gotten better.

It’s been a couple years since I worked with them on that. But there are commercial vendors out there that are doing stuff. Then there are others that you’ve mentioned containerization and they just kind of gloss over. One of the reasons why I really like super computing is I can go see all the vendors I work with and I very often would go and harass them at their booths and be like, “why can’t I containerize your software?” Because they are at SC, you tend to be dealing with people who know the field that you’re coming from versus just a straight cold call to somebody who’s like, “I don’t know.” At least there, they know what HPC is and they have a better idea of why. So that’s a great place to meet up with some of those vendors. But yeah, it’s. That’s a tricky one because it’s all legal at that point.

Closing Thoughts [55:27]

Zane Hamilton:

Yeah. Thank you, Chris. That’s something we have talked about quite often is how do you do licensing with that from some of those vendors. That’s exactly what we have been discussing. Chris, thank you. I don’t know if you’ve had any more questions come in. We’re getting close on time here. Glen, I feel like you’ve been quiet for a while. I’ll give you a first shot at closing thoughts then?

Glen Otero:

I think Alan brought up a really good comment or thought or theme on reproducibility being much less important than transparency. Now that I think about it, I tend to agree because the other thoughts swirly run in my head with regard to reproducibility actually originate from other sources, right? So it’s and I’ll pick on biologists again too cause I am one. But it. It’s a problem that started a long time ago, right? Not telling us how you prepared your data or cleaned your data and things like that, right? The program you’re running almost didn’t matter if you weren’t telling me how you cleaned your data. What you threw out, the averages you took, the statistics you did.

Also on the publishing side, right? How? What’s accepted for publication, what the bar is set at. Publication is going through this huge renovation, hopefully not just being open, but having to publish your data. Like Gigascience makes you do. The elite journals, what they accept with regard to reproducibility and things there. I think transparency is a little bit more important there. Because with regard to containers, because reproducibility actually has, I think, issues that originate for other places in science and not just giving us containers is not going to be a silver bullet there.

Zane Hamilton:

Thank you, Glenn. John, I’ll go to you next. Closing thoughts?

John Hanks:

I would just throw out a suggestion. People, admins, especially people who are beginning HPC admins. Stay away from tools that try to do your job for you and learn how it works. Invest the effort in learning how things work. It will pay massive dividends down the road. And you. There are no shortcuts to knowing what you’re doing.

Zane Hamilton:

That’s good advice. Thank you, John. Jeremy?

Jeremy Siadal:

Well two thoughts. One, I thought it was pretty clear that if you are an author or a researcher publishing papers if you’re using software to do any of your results are based on software, keep a copy of that software and put it on the shelf because it might be needed. Save your sys admins a lot of reproduction work. The other thing and I like to think that part of my job is trying to make life easier for sys admins and end users. There was this interesting comment that came up, said. Look, we are not by moving to a containerized model, we are not shuffling all the work off to the user. I think the difference is the users come to you now and say, “I need this piece of software. I need you to set the licensing server. I need you to configure it this way.” I think the change is going to be, they are going to start coming to a new future and say, “can you help me build my container?” I don’t and to me that’s just a shift in work. It’s, I think in the long run, going to be a lot easier.

Zane Hamilton:

I hope so. Thank you, Jeremy. Chris?

Chris Stackpole:

Yeah, I fully agree with a lot that has just been said. I think it is very important to know how the tool works so that you can fix it when it breaks because it inevitably will break. There will be an update or something that will break it and you’ll be responsible for fixing it. I caution the use of tools that you don’t really understand some of that work because of that situation. They are great tools, especially if you are the only admin for the HPC and so everything’s falling on you. The better you can reproduce your work so that the. When a new version of the package comes out, you can easily rebuild it, and deploy it for your users. That’s all really important. Understanding the process of how you get there really does help with updates and even when you need to reproduce something and rebuild an old version.

Zane Hamilton:

Great. Thank you Chris. Dave?

Dave Godlove:

Yeah, it’s just a really interesting conversation. Just to riff really quick on something that Jeremy said. I mean, yeah. I totally think that containers are not only. So they could be used for evil to put all the onus on the users, but I don’t think that anybody is really doing that. I do think that users ultimately might come to admins at some point in time and say, “can you help me build my container?” I also think that admins, support scientists, application scientists, and so on are also taking that step for the users. How many cases. A lot of times users come to you and say, “can you help me install this complicated thing? I don’t know how to do it.” Then as an admin you might say, “sure, I’m going to do it in a container, and by the way I’m going to do it in such a way that you might not even have to know that there’s a container there to use it.” Another great use case for containers is just to make the application engineer, the scientists, or the admin’s life easier as they support the user.

Zane Hamilton:

Thank you. Jonathon?

Jonathon Anderson:

My main thing right now is just to encourage people. This is an area of innovation. There’s cool new tools and new techniques that are being developed right now. If you are interested in that kind of thing. Looking for ways to get involved, maybe you’re an open source guy and want to do more of that. Or you’re, you have a pain point. There’s activity here and there’s a lot of eyes working on this problem. I think that there’s a lot of opportunity for people to contribute.

Zane Hamilton:

Thank you, Jonathon. Alan, I’ll give you the last word.

Alan Sill:

Well, really fantastic job. All you focused on some great points. It’s not enough time to make all the observations that could be made, but I just want to advocate for a couple things. One is, as I mentioned earlier, to Greg’s horror, the Exascale Computing Project’s E4S single container. It’s not intended to be deployed that way. It’s just a package for the built stuff that you can use for testing. I think there would be value in us doing something like that aimed not at the Exoscale stack, but at a typical HPC stack. What is. What I’ve done to try to facilitate that is I’ve gotten a hold of all of the ci/cd steps that they use. some of them are internal, but we could reproduce that.

We could actually pick up Jonathon’s point. We could get a community effort going to build a typical HPC stack that you could. If you just wanted to get started, download the container, try it out, and then go back to the recipe and build. Because we will include the recipes, build what you want. I think that there’s value for the community to try to reproduce what E4S has done on the local scale. Then I just want to bemoan the loss of the Singularity Hub. In practice I see a lot of people’s workflow cause we only support Singularity or Apptainer on our clusters is to go and pull the Docker container and into Singularity. We are relying on the Docker hub. Is that the right way to do things? I don’t know. I think we should look at having a community resource for building and maintaining Singularity containers as well. I’ll stop there.

Zane Hamilton:

Oh, that was a great point Alan and we have been talking about that quite a bit inside. I know Dave has been talking about that a lot. Really appreciate it. Thank you guys for joining. Our time is up. It’s actually a little bit over. So thank you for joining this week. Looking forward to seeing you guys next week. Please go like, subscribe, and we will see you. See you soon.