Fuzzball HPC 2.0 Workflow Demonstrations
- What is QMCPACK?
- Creating a QMCPACK workflow
- Can You Egress To Any Store?
- Data Ingress and Egress
- Running a Dask Workflow
- How to Pre-Process Data
- Task Arrays
- Is It Possible to Pre-Spool Resources?
- Reusing Spooled Nodes
- Creating A Quantum Espresso Workflow
- What Does a Workflow Do If a Node Fails During a Run?
- QMCPACK Results
- Dask Output
- Quantum Espresso Output
- How Do NUMA Architectures Interact With The Workflow?
- Quantum Espresso Output
- Multiple GPUs Per Node
- Zane Hamilton, Director of Sales Engineering, CIQ
- Gregory Kurtzer, Founder of Rocky Linux, Singularity/Apptainer, Warewulf, CentOS, and CEO, CIQ
- Forrest Burt, High Performance Computing Systems Engineer, CIQ
- Michael Young, IT Manager, CIQ
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Full Webinar Transcript:
Good morning, good afternoon, and good evening. Welcome to another episode with us from CIQ. We appreciate you spending time with us today. Today we’re going to talk more about Fuzzball and workflows. Forrest has spent a lot of time coming up with some great examples, use cases, and topics for us to talk about today. So we have Forrest and Michael, and Greg to join us. Welcome, fellows. Good to see you. Forrest, I know you put in a lot of time over the last few days to put together another set of good examples of what Fuzzball can do and what use cases we’re seeing. People are asking many questions about it, so why don’t you go ahead and tell us what we’re going to see today.
What is QMCPACK? [00:00:48]
Today, we have a couple of different workflows that we will be showing; the first example is QMCPACK. It is a piece of software people use for doing Quantum Monte Carlo computations– some very low-level quantum interactions, different types of very specific simulations. It’s a fully-featured suite; there’s a lot of different things you can do with it, so we’ll be taking a look at that. We also have Quantum ESPRESSO queued up, which is essentially quantum dynamics software. And it is used for very nanoscale materials modeling, nanoscale electrical calculations, that kind of stuff.
Both of those workflows are more focused on an extremely small scale. We will also be looking at a random example of using what we call a task array in Fuzzball, which allows for embarrassingly parallel runs to be done. We’ll be looking at that in the context of something you could view as an example of quantitative finance. Without further ado, I will go ahead and start off QMCPACK and go through what we’re looking at there. I’ll go ahead and share my screen.
For those who are watching, feel free to ask questions. Let’s keep it live and interactive. Whatever you have, throw it in the chat. I also want to point out that Forrest is doing this live. It’s always fun and exciting to do live demos.
Yes, it is indeed. As you can see, I’m here on my VS Code terminal. That’s where we’re going to take a look at the YAML definition that’s going to power this QMCPACK one.
Creating a QMCPACK workflow [00:03:03]
Let’s move forward. You can see that I’m here on the command line. We’re just going to do a quick Fuzzball workflow start, and then we’ll provide a name for it. We’re going to call it users/[email protected]/qmcpack-demo. Then we just provide the path to where that workflow definition I’m about to go over is at, so we’ll do this: /qmcpack.yaml. Perfect.
Okay, we’ve just gone ahead and started that workflow. I’m going to put its status up here so we can monitor it, kind of switch back and forth. You’ll see we have a few different things going on here. I will explain everything. Let’s switch back quickly to that VS Code window that I was on, and we’ll run through what everything is doing there. This QMCPACK workflow has a few different jobs in it and also has a few images and a few volumes. It uses more than just one of those things like maybe we’ve shown in some of our past demos. Just beginning at the top of this QMCPACK workflow, the first thing we’re doing is creating two different volumes.
We’re creating one for some data to be ingressed into, and then we’re creating some data to be egressed out. This is done just for demonstration purposes. There’s no specific need to create an entire directory to do an ingress or an egress. You can do it with one just as easily, but to illustrate moving data between volumes and that type of thing, I’ve chosen to do this with two. You can see that I’m reaching out to an S3 bucket on AWS. I’m going to pull down this file for QMCPACK to use from that bucket. Then we’re going to place it into this first volume of QMCPACK data at this location. This is an ephemeral volume, meaning both of these are ephemeral volumes, meaning they only exist for the lifetime of the workflow.
This data will be ingressed. The jobs will be able to operate on it in a persistent manner, relative to other jobs. Then once the egress here is done, the ephemeral volume will essentially be dropped on the floor, and the data in it lost. So anything not egressed will be lost.
So, like I said, we have two volumes. We’ll move to the jobs now. We have four jobs in this case. The first one is just an untar. This is basically just untaring this tar.gz that we ingressed in this first step. We’re mounting that data volume that we ingressed into with this mounts field to this location inside the container. As we’ve noted, in Fuzzball, everything is a container. All these workflow jobs that this computational workflow will be doing are run out of a specific container. In this case, because we do not have any other need for specific tooling, we’re just using a small Alpine container being pulled from the Docker hub to provide tar in this case for this initial untar job.
The next job up here is, and you can see that we have eight cores requested, 30 GB of memory – the command is just a tar command – this is just spinning up a standard CPU compute node to do an untar on. We have the run QMCPACK step, which is a little bit more complex. You can see that the image we’re using is something we’re pulling from Nvidia’s NGC. This is something that they’ve put together to run, especially on GPUs. In this case, as far as GPUs go, we’re requesting one of them.
We’re also requesting four cores and 60 GB of memory. This is going to match, more or less, the definition that I’ve set up that’ll pull us down a node off this public cloud provider with one Nvidia v100 GPU in it. With this MPI field, we’re requesting four nodes with an open MPI implementation of our MPI wrapper. You can see that we’ll end up getting four of those compute nodes, so we’ll have four v100s in total with our MPI wrapper managing parallelizing that over all those nodes. This step takes 15 to 20 minutes to run through all of that. We have an environmental variable that we’re setting to ensure that we’re using the right process management library relative to open MPI.
Then you can see the command we’re running here is just QMCPACK based on input data. You’ll notice that we’re mounting the QMCPACK directory to this host PWD directory inside of the container. We’re also saying that to be our current working directory. We’ve done that because we untared in this step all that data into that host PWD directory. In this case, we need to specify our working directory and then what file in there we’re going to be feeding into QMCPACK.
The next couple of jobs are pretty simple. One is just an “ls,” so we can have visibility into what data was produced from this run, once it’s done. There’s nothing too special there. It’s, once again, just Alpine doing an “ls -larth.” You can see what these require–things that we’re building up, a directed cyclic graph of execution here. We do that because the data has to be untared; the simulation has to be run, and so we have a path of execution that we’re trying to follow here. This “ls” command, or this “ls” job, will show us what’s in that host PWD directory. Our last job is tar outputs, the volume egress directory.
We’ll notice we’ve now got three different containers being orchestrated and pulled from somewhere and managed for this workflow. These tar outputs will use something very minimal to tar up what’s in the host PWD and place that into egress. Once this job is done, the workflow itself will be complete at the bottom of this egress and this file, which we tarred up in that step. You can see that this is the directory that we’ve created, that tarball there, and you can see that we’re going to egress that same directory up to this S3 bucket right there. Let’s move to the other screen and see how this is working.
I’ll show you this running, and we might get into a couple of the others because this will take about 20 minutes or so. So just moving back over to this quickly, you can see that we’ve been watching this status output here. To map what we’ve just seen in that workflow to what we’re seeing here in the status output, we have all these lines here along the left. Each represents a different part of this workflow. You can see we have ‘workflow’ appearing at the top. That’s just delineating the workflow as a whole. It represents the time that overall it was started and finished and then its name. We have the two volumes that we set up right here that finished instantly. Those are just setting up the QMCPACK data and egress directory that we will use to store some data for these jobs. So those are the two volumes being set up.
We then have the two file transfers. One of them says finished, while the other is pending. You’ll notice this one shows something moving in from S3. This one shows something moving out of the cluster and onto S3. This one is finished because this represents the data ingress that we showed there. This one is pending because it represents the data egress that won’t happen until all these jobs are finished. We have these images here that have been downloaded as well. Each one of these image lines is a container that has been downloaded for this workflow to use. As I mentioned, we’re using Alpine to provide basic tar stuff; we’re using QMCPACK for QMCPACK. We then have the four jobs here, untar, run QMCPACK, “ls,” tar outputs. You can see that the untar finished and ended about three minutes ago. Right now, run QMCPACK will be spooling.
Can You Egress to Any Store? [00:12:06]
Forrest, I maybe jumping ahead a little bit, but the egress on that can be to anything. It doesn’t have to be S3. It can be to whatever store you want it to go to, right?
I mostly use it in the context of S3. If we had set it up as a persistent volume, we’d be able to perhaps move it out into a physical architecture. But at the moment, I mostly use that to push an S3. You’ll notice that in that workflow I had there, I have very specific parameters that I’ve set.
You’ll notice that in the secret section of this egress and the ingress, we have the S3 access key ID, S3 access key, and S3 region. These allow you to specify standard S3 API-compliant credentials to reach out to any S3-compliant API object storage out there. Sometimes, the S3 API isn’t just provided by Amazon S3; it’s also provided by some other storage platforms. You can utilize the same commands and such that you would use. You can treat it as if it was an AWS endpoint. It’s not shown here because I am hitting AWS S3 here, but we do also have the ability to set an S3 endpoint that allows you to specify that custom endpoint if some other cloud storage provider happens to have a different endpoint that they need you to use.
I want to point out here that as this is running, we should be able to see this differently now. This is kind of cool. You’ll see that this job was pending, and if I go to the Fuzzball provision resource list, which gives us the list of all of the compute resources currently running on the cluster, you’ll see that we have four of these creating. I won’t make the more verbose output come out of this just because I’m unsure if I want to have that shown here, but if we were to look at the more verbose output from this, we’d see that these are running P3.2xlarges for each of these. You’ll notice that these were created when it started running and when it was pending. That shows that this job is now running; these are the resources that it’s running on. I don’t think it uses the GPU for every bit of the computation, so if this doesn’t show an incredible amount of usage, that’ll be why, but if we were to do something like “fuzzball workflow exec”…
Data Ingress and Egress [00:15:32]
If you don’t mind me jumping in real quick, I have a slide I can share that’ll graphically describe what we’re seeing with regard to the data ingress and egress and whatnot.
So what this is demonstrating is, on the left side, you’ll see the user’s workflow coming into Fuzzball Orchestrate. That workflow is what Forrest was describing. Within that, it has the context of ingress, and the context of the job pipeline, and then the context of the egress: that comes into Fuzzball Orchestrate. Now, Forrest is running this up in the cloud right now. When that workflow comes in, it will start doing a bunch of things simultaneously. You can see some of these things depicted on the right side. For example, you can see the data lake. The data lake could be any form of an S3 data store; it could be an HTTP endpoint; it could be anything. We’re going to ingress all of the bits of data we need for this job. You’re going to see the ingress that was under the volumes in that. You’re going to see the containers coming in from that as well. And then we’re going to set up the volumes underneath that.
In parallel to that, because we’re running this up in the cloud, Fuzzball will be provisioning cloud resources and instances to run that job pipeline. It will satisfy all the hardware requirements of that workflow in order to run that job. If you’re running an MPI job, if you’re running task arrays, if you need GPUs, it will provision that resource out and compose that resource as well, if you need and if you have composable hardware. Once that is done, it’ll start running that job pipeline on top of this volume. And then, at the very end, we’re going to do that egress. What you saw Forrest sharing was a lot of these pieces being depicted by the monitor, basically by what is running it. You can follow these different steps. So let’s go back to Forrest’s screen now, and he can actually show exactly what that is and what stage that is in.
As I pointed out, we have all these GPU nodes running. If I go ahead and do a “fuzzball workflow exec,” and then I provide the name of the workflow, the job that the workflow is running, and then an arbitrary command, if I were to specify TTY here, I could do /bin bash and get a shell into this. I’m just running nvidia-smi, though, in this case. We have the output from nvidia-smi coming from one of these compute nodes. You’ll see that at the moment, we don’t have any utilization. Like I said, it’s not going to use the GPU for every part of it. We wouldn’t expect that, but we do have processes running there, and we do have some amount of memory that is being used. If we go down here, we can see we have 4%, so not much, but starting to use the GPU, and that memory changing – so you can see that we have some GPUs that we’re running on here. Moving back to the status of the workflow, we’ll see where that’s at. You can see it’s still on this run QMCPACK step. Why don’t we follow the logs for a second and see what this is outputting,
We can see that we got this pending message when nothing was being printed on the log. But we can see what’s been run as a part of this simulation. Total number of MPI groups, one; 16 ranks per node, four accelerators per node, one. We see that we have one GPU per node there. You can see that we’re finding accelerators and stuff there. We get a memory usage report from it as it runs. Like I said, we’ll get this pending message – you might want to stifle that and make a note – but we can see that this is running. That’ll take another few minutes to run, so if we wanted to, we could start something else right now and go through that.
Running a Dask Workflow [00:19:54]
I am going to hop over to my Dask example, really quickly. This is a basic example. This shows Dask running in general within this platform and more of the purpose here is to show how a task array works and how it can allow you to process many, many input files at once in an embarrassingly parallel manner.
We have this workflow here. Once again, we’re going to reach out to my S3 bucket and pull down more-dates.targz. more-dates.targz has a hundred input folders in it, each looking like it’s got a random date attached to it. Inside each one is a file, prices.txt, that has 20,000 random numbers between 50 and 100. We’ll bring that down, which will get pulled into this data volume we have here, v1. This is an ephemeral volume, so this will last for the lifetime of the workflow, but then be destroyed once the workflow has finished.
We’ll pull down this bit of data from that S3 bucket. We’ll do a similar untar job to what we did before. I’m using a private Docker registry to pull this down and just showing you how you can pull it down directly from a private repo on the Docker hub itself. I have an access token there configured within our secret system and stuff like that to be able to get in. I’m pulling down this Dask compute image that I made. This includes Dask and some other components. Once that’s pulled down and is untared, we’ll pre-process the data. We’ll use this little script right here called munge-folders.py. And there are a few different ways you can do this.
How to Pre-Process Data [00:21:58]
This is a Pythonic way. When we have a task array, the interface that we get into being able to assign different parts of that task array– or assign the resources that the task array brings to different subtasks of the overall computational job we’re trying to run– it’s through this FB_TASK_ID variable here. Depending upon the task number here and this task array – and I’ll explain these a little bit more – I’m getting a little ahead of myself because I have to explain this script right here. But this FB_TASK_ID is set with the task number relative to what’s set here in this task array. Essentially, what Munge folders does is it goes through the 100 input files that we have and appends to the front of each one just a running count from 1 to 99, so that the task array can then have a way to pull down and link a specific input file to a specific computational resource that’s brought up.
Task Arrays [00:23:01]
This will take that dates file, munge it a little bit, and add those numbers to the front. Once that’s done, you can see we’re using very minimal resources; that’s a quick task. We then get to the last one, which is ‘run Dask.’ As I have mentioned, this is a task array. I don’t think we’ve shown one of these before, so I’ll elaborate. A task array enables the embarrassingly parallel model of high performance computing. We’ve gone over MPI quite a bit, which is actual core-to-core and node-to-node communication happening in a high performance computing or parallelized simulation.
Embarrassingly parallel is the case in contrast to MPI, where incredible effort is required to write an MPI application. With the embarrassingly parallel model, little to no effort is necessary to separate the input into discrete computational tasks that can then be run independently of any other task. In this case, we have a hundred input files. We’re going to start in light. I might have it off by one error here, but I’m going to ignore that and say it’s a hundred. I have a hundred input files here. I will start numbering them at task one. I’ll stop, and they are numbering them at task 99, and I’ll run 10 of them at once. What we’re going to see this do is bring up ten nodes at once to run this. Then this will do this processing right here in this Dask compute file. We are essentially opening that prices.txt file, reading everything in from it, and then feeding everything that we had in it.
This will take those 20,000 input numbers from those 100 input files, raising each one in those lists to the power of two. To sum up, every one of those 20,000, now raises the power of two numbers together, and then divides that number by 20,000, so we essentially get the average more or less of what those 20,000 numbers are in there.
Moving back to this task array here, with those 100 input files, it’ll map to that. We can see that Python 3 runs, and then in this file, we take in the first member of the arguments array. That’s why we pass it to it right there. In this case, you can see how we link the task array into some actual code. I’m going to go ahead and run this. In the meantime, we’ll also return to QMCPACK because it looks like it is doing something.
As you can see, QMCPACK is still working. The run QMCPACK step is finished. The “ls” step is finished, and the outputs are actively being tarred. Once that’s done, this egress will continue. In the meantime, let’s open this up real quick. We can see here, QMCPACK execution was completed successfully from the run QMCPACK job. And you can see here tar outputs were gradually getting an output of what’s being tarred in that directory. And that we’ll have on S3 in just a little bit. Let’s go ahead and start Dask before we run out of time.
It’s ‘dask-demo dask/numbers/number-task’array.yaml.’ You can see some other stuff in here. That’s essentially just testing other scripts and things like that—just the mess from getting this working that I forgot to clean up. We’ll go ahead and watch this status for a moment because this will be fun to watch unroll. This workflow has started, and we’ve got this S3 ingress happening right now. We’re bringing in that ‘more-dates.tar.gz’ file from S3. We’ve already got the image cached because I’ve been running this, and in here, in just a moment, we’ll start to see these jobs kick off. We’ll watch this here for just a moment. When the task array kicks up, it’ll essentially unroll 10 tasks, and this output will become much more complex. I want to make sure that we capture that here. Once we see that, we’ll let this run for a second. Then I think we will pivot quickly to Quantum Espresso because that takes about 15 minutes to run.
I love seeing the output changing as it goes through the different tasks and watching it. Something’s started; something’s finished; something is happening over here now. You can see all of the dependencies of the job graph being resolved in real time and implemented in real time as you watch this output. It’s super cool.
As you’ve seen here, this job here, ‘run-dask,’ has started, and it’s unrolled 10 tasks out of it. This has started this top-level job for the task array. At the moment, it’s spooling 10 nodes to run these on. So these are all pending. We’ll see those come up here within the next few minutes. I’ll make a Fuzzball provision resource list so that we can see.
Is It Possible to Pre-Spool Resources? [00:29:35]
Is it possible to pre-spool or pre-provision some of these resources if you know you will have some resources you want to run, so the cluster is ready for it?
Yeah, we do Fuzzball provision resource create. I think we would do something like users/[email protected]/default_cpu_node. That spooled up a resource, so if we were to look closer at this–and this is a bit of a mess at the moment; I think a few different workflows are conflating here, so that’s why we see some are running; some are creating. I suspect most of the ones creating are the 10 that are still spooling. We’re just waiting a couple of minutes for those to come up.
Hey. Sorry for throwing you off on that.
That’s fine. It’s a good point that you can pre-provision those. I’m not sure how sophisticated at the moment the linking of those created resources is, like whether or not it’ll still want to provision some. I’d have to investigate a little more, but there we go. We see this starting up there.
If this was a physical cluster that was on-prem and all the resources and nodes were already there, there would be no spooling up. All this happens instantly, right?
Exactly. Because the resources are already there, it does not have to negotiate with the public cloud provider, anything like that, to bring these up with whatever playbooks. We see these starting here. This will start to finish these tasks, then we’ll begin to see more starting. That concurrency, in case I didn’t make it clear, means that we’ll have 10 nodes running at once. So we’ll have 10 Dask tasks maximum running at one time. We can see we have, at the moment, 7, as we’re waiting for a couple more nodes to spool.
Reusing Spooled Nodes [00:32:00]
It looks like it’s reusing those nodes, so nodes that have already spooled up and been provisioned. It’s not shutting those down and then restarting them. We’re seeing them come back up much faster now.
I want to point out that there are nuances to that, depending on where you’re looking. Still, in the case of a task array, yes, it will just continually land those tasks on the resources that are already done, which you can see the usefulness of if the instance you’re trying to use is $30 as an instance, you wouldn’t want to have to spin up 30 or a hundred $30 instances. You want to do 10 of them and then have it run. We can see that this will go through and process those files pretty handily. We can go ahead and take a look at the log output as this goes. As I mentioned, this will print the average of those numbers. They were all generated from about the same random space. We should see that the numbers being printed out will all be within some small amount of each other.
What we see here is all of these tasks gradually print out their outputs as they go. We can see the task name right there. We sometimes get these heartbeat ones failed, but if you look, we’ve got one on task 22, but then task 22 was completed successfully right there, so we’re not worried about those. You can see that we’re essentially just printing out the name of the file from which we derived this data here from. In this case, this is meant to represent January 30th, 1999. In this case, we assigned this to task four. We munged this little four dash here at the front to make sure we can map that to a task. If we come down to the bottom, we see that this is still running. I’m going to check the resource counts a little closer, to see if everything is cleaned up.
While Forrest is bringing that up, if you have any questions, please start. Enter those into the chat so we can answer those as we go.
While that finishes up. I’m going to move over really quickly and show the Quantum Espresso workflow because we still need to show some of the AWS stuff.
Creating A Quantum Espresso Workflow [00:35:12]
In this Q Espresso workflow, you can see that we have one volume we’re creating here, Q Espresso data volume. We’re going to bring down, once again, some data from that S3 bucket, just these QE inputs. The inputs for QMCPACK and Quantum Espresso are just available out there. I believe that they’re the examples that are on the Nvidia NGC, if you visit the container pages for these pieces of software there. So you can see we’re going to do an ingress. We’ll bring that bit of data in here, and then once that data is done, we’ll egress that out to this S3 bucket, once again.
We have three jobs in this case. The first one is an untar, not like anything we haven’t seen before, so that’s pretty simple. The second job here will do the same thing resource-wise as QMCPACK did. It’s going to bring up four nodes with the open MPI implementation of the MPI wrapper, looking for four cores, 60 GB of memory, and one GP on each one of those nodes. In total, we’ll have four V100s coming up. You can see that we are making our CWD the inputs file that we untared in the prior step. And then we’re just running the QMCPACK executable over that input data. Two files get unpacked from that. This tarball right here, and then you can see the image pulling down from the NGC, and there’s not much else to see on that one. Then up here at the top, we, once again, tar up the outputs, and then we place those into that data directory that we have so that we can egress those out to S3 afterward.
We also know we have requires here, run Q Espresso, untar, so we’re creating a basic directed acyclic graph for this to follow. We’ll move over really quickly to the command line once again. Then we’ll go ahead and run that one.
What Does a Workflow Do If a Node Fails During a Run? [00:37:27]
We do have one question for you, Forrest: what does the workflow do if a node fails during the run?
If a node fails during the run, it depends upon what the behavior you defined in your workflow is. We have the ability to set policy on a workflow job. You can set two types of policies, a timeout policy, and a retry policy. A timeout policy defines how long the workflow job should be allowed to run before it gets terminated due to timeout. And then, retry will tell you how many times a job should retry if it fails for any reason. And so that’s just as simple as adding that policy retry to your workflow.
So pretty simple. Just add a line and it’ll retry.
To link what we’re doing here to that definition we just looked at, you can see that we have the workflow up here, the start and finish time of the workflow, and the one data volume we created in that workflow definition. We have the file ingress that we did. That’s already finished. We have the file egress that we will do pending because these jobs haven’t been finished. We have the untar. We have one image we’re using, just Quantum Espresso in this case. Then we have untar here and just wait on Quantum Espresso. This will take, similar to QMCPACK, a few minutes to spool up those nodes.
QMCPACK Results [00:39:12]
In the meantime, let’s hop back over to QMCPACK and see what it’s done. It’s finished, including our egress and everything. If I hop onto little AWS S3 “ls” and do, let’s see our S3 burt-storage, we can see 11:28 and 4:28. Just about 10 minutes ago or so, qmcpack-results- full-tar.gz and AWS S3 CP. We’ll give this a moment to download, and then I’m just going to tar -xf that and show the results we have there. I don’t have any fun graphs or anything I can bring up and show. But we should be able to see the output files here easily and how we’ve just recovered those.
You’re just showing off what a reasonable internet connection looks like.
That was amazing.
Yeah. I definitely like my one gigabit down. They tell me that the local fiber place can get me one gigabit up, which is very useful for container pushes. So I’m exploring that as well.
I’m not going to tell you what mine is, where I’m currently at. Sometimes I get 20.
Another question. So how do Numa architectures interact with a workflow and are there some examples?
Because that tarred up the host pwd directory, that’s what it, unfortunately, gets decompressed as. But if we go in here, you can see that we have all the files we just wanted to recover. The initial data files that we brought in are still tar.gz-ed, so we may want to avoid doing that if those are significant because we wouldn’t want just to be copying those over. We have all of the information that we provided to this. I believe this is the actual input file itself. We have all these scalar.dat files that represent the outputs. I don’t have a little QMCA demo spooled up to pull, like I said, some graphs out of this. You can see those are the results, and if we go to this and print out all the logs again, as I showed earlier, we can scroll back through this. We’ve got device memory allocated via CUDA Allocator. We were talking about allocating GPU memory.We’ve got the “ls” here that we ran. So that’s just showing us what’s in that data directory that we ended up tarring up and bringing down.
It might be Quantum Espresso’s output that has a very obvious point where it says using CUDA to accelerate MPI or something like that. Ah, here it is: running on an NVIDIA GPU via CUDA acceleration.
So that’s QMCPACK; once again, the execution was completed successfully, and we have our timing and information like that here at the output. While Quantum Espresso is probably still working, we’ll hop back over to Dask to finish that out and show what that looks like.
Dask Output [00:43:17]
We can see in this output we have the single run Dask and then 100 or 99 tasks coming off of that. All of them start and finish at 1:28: that’s the whole job itself. We expect that to take a few minutes, all of these finishing fairly quickly, 11:31:15, 11:31:48. As you can see, as we progressed through all of these input files over seven minutes or so, we processed all 100 of them. We can extract all that data if I do a log -f. Once again, as I noted, we get this odd error sometimes, but we ultimately still get results. You’ll notice these are all somewhere in the vicinity of 5,800 and about 40. All the random numbers here were generated from the same between 50 and 100. We expect to see about the same thing on each one of those, which is good that we do.
That’s Dask and QMCPACK. Let’s move back to Quantum Espresso and see how it’s doing. We can see it’s running right now. Why don’t we take a look at that while it goes?
Quantum Espresso Output [00:44:28]
We are still getting that pending message, and we’ll have to note that it should be stifled, so it’s not just constantly printing. We can see that we’re starting to get some output here. I imagine we’ll get some more here in a second. If we look through this, the GPU acceleration is active, GPU-aware MPI enabled. We can go and see 16 MPI processes, four threads per MPI process, MPI processes distributed on four nodes. Then we get information about the simulation. We can see this is printing except when it’s pending. I think this does 20 iterations, and it’s on iteration one. That will take a little bit to run. We’ll move back over to the status.
You can see that at the moment, we only have five nodes running compared to the very large amount we had earlier. All of those nodes that we spooled up for Dask and those that we spooled up for QMCPACK have been spun down and are obviously no longer visible in our resource list. We have five, we expect four for this run Quantum Espresso one, but it’s probably just something taking a little bit to spin down. I’m sure it’ll disappear here in a moment. This will probably take about 12 minutes or so to run, and so once that’s done, I think we should hop over and take a look at that S3 bucket and show how those results have been uploaded there as well.
How Do NUMA Architectures Interact with the Workflow? [00:46:11]
Did we address the NUMA question?
Regarding the NUMA question, let’s hop back over really quickly. I will bring up my text editor because that’s the easiest spot. To pick a random one of these, we’ll use the run QMCPACK job because it’s the one that we’re most likely actually to have some use for a NUMA node on. You’ll notice that when we set the CPU cores that we want here and whether or not we want hyper-threads to be present or enabled, you’ll notice that we can set a NUMA affinity here. There are two other possible affinities. I can never remember the name of the one for the whole node. I believe they are NUMA, socket, and core at the moment. The affinity we’re looking at here is where we will choose the CPU cores on this node from.
If we have 32 cores on a node, depending on what’s going on with the server, we may want to land all of the cores we’re using within a certain subset of the total amount of cores on the node available. This is typically done when we have what’s called a non-uniform memory architecture underlying in the server, like the compute node. NUMA is essentially where, say, you have a 32-core machine, and you have 200 gigabytes of memory on that node. If you have four NUMA nodes, you would probably have something like eight cores to a NUMA node. Each one has 50 gigabytes of that memory that they can access fundamentally faster than any other cores in that system. A NUMA architecture, being NUMA-aware in your simulations, is a really easy way to instantly add efficiency because, say, we only need eight cores of that 32. It becomes highly advantageous to land all of those cores on the same NUMA node, so they can all have a space in memory that they can access faster than any other core.
That’s the NUMA affinity. In this case, it’s four cores; I don’t think it’s going to be too important. We’re not under-subscribing this compute node; we’re using all the cores available on it. By default, our affinity isn’t too important there. But you can see how if we were doing a subset of a node or something along that type of thing, that affinity would immediately be a way to add efficiency to your simulation with just a couple of types on the keyboard.
As I noted, socket and core are two other possible affinities.Socket minimizes the cores to the number of sockets. If you’ve got a compute node with two 16-core CPUs in it and specified that you wanted 16 cores and input the socket affinity, it will land all of those cores on one socket. Once again, if you’re looking for socket-based affinity, that’s how they do that.Then core is the best course; it just picks from anywhere on the node. That’s how we do NUMA awareness.
Just another interesting similar option is ‘by-core.’ ‘By-core’ allows us to specify whether or not we want the amount of memory that we tell it to go and look for to be for each core. In this case, we’re saying ‘false’ because we don’t want to need 240 GB in total – 60 times four. We say ‘by-core: false.’ We’re only looking for 60 for all four cores in total. I’m going to switch this, and we’ll look back on the command line and see what’s still running. As I mentioned, this will take a few minutes, but let’s see what this is doing now.
Quantum Espresso Output [00:50:03]
It looks like it’s finishing. These are just random ending warnings that I presume happen because this is literally written in Fortran 90. I’m sure we’ll get some interesting input from anything written in Fortran 90. As I noted earlier, when we look through these results that we have here, we can see that it is very obvious that we’re running on four nodes, and it’s very obvious that we’re using some GPU acceleration. We can see we have 56 atoms of gold that we are simulating in this case, if I read all of this correctly. You’ll see that we get a little bit of information here.
This is the amount of RAM that we’ll need, which is one of the reasons why we’re using – well, this is a little scaled down from the example, but it doesn’t matter. We can say that we did 20 iterations. We did 13, actually. My test runs did 20, but maybe I changed something. But you can see, we get all the numbers we expect and all the output. We have a note here that says we have an output data file being written that we’re going to push S3 in just a second. You’ll see we have the job done. This run was terminated at this time. We get CPU and wall clock there, and then we also get GPU time printed out there. Then we’re just getting the tarring up of the outputs here before those will be uploaded. Once tar output’s finished, we’ll see that data egress start. Then, just like we did before, I’ll hop back over to AWS S3 “ls.” And I will show that there. While I’m here, let me bring up my AWS console so I can also show the bucket view there as well.
This is just the AWS bucket interface. You can see that I’m on my burt-storage bucket that I’ve been referring to this whole time. You can see we have more-dates.tar.gz, which is the quantitative finance data that we pulled in. You can see we have S32-example.tar.gz, which is what we pulled in for QMCPACK. And you see, we have qe-inputs.tar.gz, which we pulled in for Quantum Espresso to use as an input. Then, we have qmcpack-results-full.tar.gz here, which is the QMCPACK results we just uploaded. It looks like the Quantum Espresso workflow is just finished in the background. If I refresh this, there we go, qe-outputs-ausurf.tar.gz, which is exactly what we expected there. If I go back over to the command line, you’ll see that this workflow is finished. The data egress finished there while we were looking at that bucket view.
I am just going to make a quick little temporary directory to move this into, just to make sure I don’t conflict with anything.That might take a little bit to untar, but we’ll be able to look at those results here in just a moment. So that’s about what I had as far as demos.
Excellent. Thank you very much, Forrest. I know you put in a lot of time getting this together for us.
It’s very cool to get to show off. As you can see here, now we can go into this qe-inputs file. I’m not concerned about not preserving what the volume looked like when it was attached to it, but I can see if we go in here, we continue to go down these directories. We have our output files there.
Fantastic. Thank you very much, Forrest, for that. Greg, I don’t know if you guys have anything you want to add before we close.
I just wanted to say that was super awesome, Forrest. You put a lot of work into that, we can tell, and it really demonstrates the use cases of what it is that Fuzzball and, more generally, how an HPC 2.0 cloud-native federated computing platform can work, how it can help users and researchers. I think you did a great job demonstrating that, so thank you so much.
Multiple GPUs Per Node [00:56:24]
Thank you. If I can just share my screen one more time. I can’t show this live just because there will likely be issues on the public cloud provider’s side with having enough of these available, a problem with their capacity, and not anything to do with our system. But if I run this right here, this is something that I ran yesterday. We’ve featured LAMMPS examples here. We had our most advanced one yet yesterday. We can see–and I’ll pull the workflow off of this in just a moment as well, but we can see that this run I did yesterday, they don’t always have the capacity in these instances, so it was a bit of a crapshoot getting it running–so I only have results here. But we can see that we have this four by two by four MPI processor grid, that we are using up to four GPS per node. This is also running an MPI, as we can see, due to the timing loss and communication here and then also through the workflow.
I wanted to show we’ve done a lot of single GPU node examples thus far, but I just wanted to make sure I point out we have no problem running across nodes that have multiple GPUs as well on them. These results you’re looking at here were generated from a run of LAMMPS that was done over two nodes at once with four GPUs on each node. In total, the results that we see right here are generated from a run of LAMMPS that’s running over eight v100 GPUs at once across two nodes on this public cloud provider. I can’t easily demo that live, but I wanted to share those results because it did work at least once. They were able to issue me one of those, so it did work. This is multi-GPU.
We’ve seen this several times when we’re trying to provision resources in the cloud. And I know that the idea of the cloud is you have an infinite number of resources whenever you need them available. We’re actually not seeing that. Depending on what availability region we’re on, which cloud we’re on because we’re testing on all the major clouds at this point, we’re getting mixed feedback regarding where we can spin up big resources, big jobs. We’ve hit the top of capacity now a number of times up in the cloud. This is one of the reasons why it’s so incredibly important to have a multi-availability zone, multi-cloud solution, such that you can find where those resources exist and bring your job to those resources. Because if you’re running in one, you’re probably going to hit scalability issues because the cloud is not actually infinite, especially when we’re doing hardcore compute-focused jobs. Being able to distribute that as wide as you can and find the best prices and resources, match that up, and automatically have that job run in the correct location, it’s kind of a big deal.
Thank you, everyone, for joining and tuning into the webinar. It was fun to put all this together and show it off. I hope it pushed the envelope for what we’ve shown Fuzzball can do.
That’s great. And guys, we always ask that you like and subscribe so you can stay in touch with us, and we can keep putting out things for you to watch and ask us questions on. We appreciate it, and we will see you again next week. Thank you very much. Thanks, guys.