Fuzzball HPC-2.0

Webinar Synopsis:

Speakers:

  • Zane Hamilton, Director Sales Engineering, CIQ
  • Forrest Burt, High Performance Computing Systems Engineer, CIQ
  • Gregory Kurtzer, Founder of Rocky Linux, Singularity/Apptainer, Warewulf, CentOS, and CEO of CIQ
  • Ian Kaneshiro, Software engineer, CIQ
  • Robert Adolph, Chief Product Officer & Co-Founder, CIQ

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Good morning, good afternoon, and good evening, wherever you are. We appreciate you joining us for another webcast with CIQ. I am Zane Hamilton, and I just want to say thank you if you’re coming back, if we’ve seen you before. If not, welcome, go ahead and subscribe so we can stay in touch and you can keep up with what we’re doing. Today we’re going to talk about Fuzzball. I know there’s a lot of questions out there about what Fuzzball is, where it came from, and what it means. We have a nice host of people today to bring on. We have Forrest, Ian, Robert, and Greg. Welcome, guys. 

Forrest Burt:

Hello, Zane.

What is Fuzzball? [00:00:36]

Zane Hamilton:

Glad to be back for another week, another webcast. Let’s jump into it. I’m going to ask the question: what is Fuzzball?

Ian Kaneshiro:

This whole thing was your idea, Greg, I think you should kick it off.

Gregory Kurtzer:

Thanks. What is Fuzzball? Fuzzball is the next generation of high performance computing or HPC 2.0. What inspired us to create Fuzzball really had to do with cross-pollination between what we’re seeing in terms of capabilities coming out of enterprise, cloud, and hyperscale, and how do we leverage those capabilities in traditional high performance computing architectures? 

One of the major complexities that we found is that the HPC architecture is leveraging a 28-year-old Beowulf-style architecture. Pretty much every HPC system today is still based on that 28-year-old model. I’m not ripping on the model in a negative way at all, because it’s been incredibly advantageous for us to build large massively scalable systems and support massive amounts of science through these systems.But they’re flat; they’re monolithic. Because of this architecture, it’s been very difficult to incorporate a lot of the additional capabilities and innovations that’s been coming out of enterprise, cloud, and hyperscale. Containers were the first, and I call containers ‘Pandora’s box,’ in a manner of speaking, because it was the first time that HPC really looked at what other industries and other sectors of the ecosystem are doing and said: how do we make use of that? There’s value there; we could be leveraging that. How do we use that? That was really just the first step in terms of modernization and looking at HPC from another lens. 

So what is Fuzzball? It is a cloud-native, cloud-hybrid federated meta-orchestration platform for compute-focused workflows and data. That’s what it is.

Zane Hamilton:

That’s a lot to take in. Can we break that down a little bit in terms of: why is that important and what does that really mean? 

Gregory Kurtzer:

Why is it important? Well, there’s a number of reasons. First off, traditional HPC has been really looking into: what are hyperscalers doing? What are clouds doing? What do people need to do in terms of computing in the cloud? And what is enterprise doing? And how do we make good use of these capabilities in high performance computing? But it definitely doesn’t end there. Going back now, maybe about three or four years ago, right when I was in the middle of singularity and building up containers, I started getting a lot of calls from enterprises who all of a sudden started to see that they are going to be HPC-like consumers. They even said, “We need HPC.” How to actually bring HPC into the enterprise is a little tricky.

I heard some really funny quotes about it, one of which was, “We’re trying to get our system administrators off of SSH, and here you’re telling us that all of our users and access into this system requires SSH?” No, it’s completely the opposite direction from where we’re trying to go with our infrastructure; we need a cloud-native style system, an API-based system to do this. We need a system that’s going to fit in with our existing infrastructure and whatnot. The HPC legacy architecture that we’ve been leveraging within the computing industry pervasively, again, has been fantastic. But due to its legacy way of operating, it just doesn’t fit with more modern ways of looking at how to compute. That was really what started us going in this direction. 

The Importance of Fuzzball [00:05:28]

Zane Hamilton:

You said a lot of words in the beginning of what Fuzzball is from a high level. It obviously integrates a lot of different things; it has a lot of different parts and pieces, and it can do the orchestration amongst HPC and cloud, on-prem, that hybrid cloud model. It’s a lot of things to a lot of people. But why is it important?

Forrest Burt:

One of the big things that makes Fuzzball important is that it’s an automation native solution. We see in HPC that, as technology progresses, we have both a need to bring these modern development practices into it and we also see the practice of HPC itself complexifying in terms of what workloads we want to be able to run, what pipelines we want to be able to represent. Fuzzball on both sides of that gives the opportunity to codify and formalize these different pipelines and stuff, as complex as they are, in a way that can both be easily modified to account for other uses for that data pipeline, as well as easily managed through modern CICD practices. From the user side of Fuzzball, it improves things immensely, in both the management of high performance computing work, as well as the actual doing and performance of high performance computing work.

Gregory Kurtzer:

Many HPC centers have been asked as well, going back 15 or so years: how do we make use of the cloud? You’ll see that many HPC centers have had a very tough time really making good use of cloud resources. Part of that stems right back into that base fundamental architecture being very monolithic and flat. How do you extend that? How do you create some sort of infrastructure that can span between regions, clusters, and infrastructures? How do we build all of that in such a way that we can pull it all together and then make very informed and intelligent decisions on where jobs should run? When we think about where jobs should run, we get to think about some really cool policies; things not only like architecture and resource availability, but cost of computing, including the data into the model as well.

In high performance computing, we need to be thinking about the data as a first-class citizen in terms of scheduling and orchestration. We haven’t done a very good job up until now on how to do that. When we think about data, we have to think about data locality: where is that data? Data mobility, can you move that data? And how easy is it to move? At some point, the data just gets too big and you end up with data gravity, which is just too big, the jobs get sucked into the data. Then you can also have data security policies that pervasively need to be applied across the entire realm of both movement of data and where everything is going. How do we put all of that into a policy and think about that from a meta-orchestration federated perspective? And how do we pull all of that and unite all of those resources together in a way that makes sense?

We haven’t had the ability to make those sorts of decisions before, so that’s super cool. I just want to double on what Forrest said regarding the diversity of jobs and workflows has been massively expanding. HPC used to be something very specific with regards to tightly coupled parallel applications, and anything that was not tightly coupled parallel applications, some people wouldn’t even call it HPC. Now, HPC is pretty much anything that needs to run fast, and whether it’s a single process, single thread, or you’re scaling out, or you’re doing some sort of parallel training, or you have a service that’s running, that’s doing inferencing, or some sort of data stream or pipelining of data and ingestion of that, and being able to compute in real time – all of this is now high performance computing. The scope and realm of what is high performance computing has increased massively. Our traditional HPC architectures, they’re not very good at dealing with this now.

Use Cases for Fuzzball [00:10:04]

Zane Hamilton:

It’s interesting because when I used to work in the airline industry, one of the things that they did a lot of every month, they had to come up with a schedule for planes and it had to make sense for where planes were, where the crews were, how much the fuel was, how efficient the plane was. It was something they typically just ran on, it used to just be on a bunch of HP boxes, and then it became a small Linux cluster. Those jobs would run for 30, 40 days until they were optimized enough. It sounds like that would be a really good use case for enterprise IT actually using an HPC type environment. Is that something that Fuzzball would help with?

Gregory Kurtzer:

Absolutely. Because when we started thinking about, how do we modernize high performance computing? The first stab that we took, when we were first thinking about this was: what do we need to put on top of traditional HPC? What capabilities, what APIs? It was very difficult because we’re basically trying to put a saddle on a cow. It’s not exactly the same thing, because it’s more like putting a saddle on a high performance computing race car. Maybe that’s a better analogy in this particular case. It became overly complicated; it became full of duct tape and shoelace to pull everything together. Really what we needed to do is think about this from the ground up.

So we went the other direction. We started with Kubernetes, and we said: what do we need to do to make Kubernetes good at this thing? Maybe that’s a little bit more like putting the saddle on the cow now. It’s the wrong tool for the job, especially when we’re talking about high performance computing and trying to run that through an architecture that was designed for microservices. It was not a good fit. We had to come to the rationalization and the understanding that we have to rebuild this; we have to think about this from the ground up and reinvent the architecture for high performance computing into something that is cloud-native, cloud-hybrid, federated meta-orchestration for performance-critical and performance-intensive workflows, jobs, and data. That’s what we basically set out to solve. It’s a big problem to solve, to be honest. And that’s why we were founded, that’s why we were funded, to create that. 

We’ve been working on it for about two years now. It’s been fairly hush-hush in terms of what it is we’ve been building. We’re just coming out to market now, just starting to do some announcements and letting this out in terms of what we’ve been working on. We have a number of organizations that are currently in proof of concepts right now, early access beta users. We are going to be releasing this around late Q2, maybe early Q3, and bringing this GA. In the meantime, we are working with early use cases and early individuals and organizations who want to be part of this, want to see what it is, and have a strategic interest in terms of bringing ideas back into the platform and really helping us to make sure that we are solving all of these problems. The use case that you just described, Zane, is massive. That’s just a perfect example of how these sorts of architectures can now support organizations who were never doing high performance computing before, but all of a sudden are looking at HPC and thinking of themselves as being HPC consumers.

Zane Hamilton:

Let’s see, we got a question that popped up, but I wanted to pick on Robert before we answer that question, just real quick. I know that you talked to a lot of different companies, a lot of different people out there. What are the dominant use cases that you’re running across right now? The people that are interested in Fuzzball, what are they trying to solve? What are those use cases they’re trying to solve with Fuzzball today?

Robert Adolph:

It’s very varied, like Greg was saying. It’s all the way from traditional high performance computing, which we do extremely well and can do on any hardware and any cloud now and tie that together into one unified resource. It’s also machine learning, AI starting to come into play in a lot of different places. That’s where the flexibility of the solution really comes into play more; that’s what we’re starting to see more and more. Even down to running Jupyter Notebook from the HPC resources, VDI, we’re starting to run into a lot more persistent jobs as well. It’s flexible enough to solve all those varied problems at bare metal speed when it’s deployed on hardware itself, but also has the flexibility to be dynamic in the clouds as well. That’s what we’re seeing over and over again.

What are the Components or Technologies in Fuzzball? [15:12]

Zane Hamilton:

Excellent. Tron, your question is: what are the components or technologies in Fuzzball? I think that’s an interesting question and it could be a big one.

Ian Kaneshiro:

This might take a second, but we can start at the base level. We have several layers, which is a typical way of describing software stacks to what we would call a full Fuzzball cluster. The first level is Fuzzball Substrate, which is really just a single node resource manager that is able to spawn containers with a leasing model that allocates resources to containers based off of an API. That allows us to have a set of nodes, manage all the resources on the nodes with a higher level orchestration tool, and run containers on all of those nodes in order to run basically any application stack that is suitable for the hardware resources available for that node. Just to give a little bit of a parallel, for traditional HPC with a batch scheduler, if you’re familiar with container technologies, like Apptainer or others that are available, what you would typically do in those environments is have a batch scheduler that spawns a process running as your user on that node, and then you have that process call out to a CLI tool that spawns your container. Fuzzball Substrate takes the direct job request for running a container and immediately spawns that user container for us. We don’t have to have another tool in the middle. This is part of what Greg was talking about, about reducing the complexity in the stack and building it from the ground up so that we don’t have as many integration points, and we can manage everything basically from a single control plane. 

On top of Fuzzball Substrate is what we call Fuzzball Orchestrate. That’s actually a set of microservices that run in Kubernetes. We use a Kubernetes cluster to manage our microservice stack because Kubernetes is a very good tool for that job. We could use another orchestration framework for running microservices if we wanted to, but that’s what we chose because it’s best in class at the moment, at least in our opinion.

That will basically allow us to manage the inventory of compute nodes that are available in the cluster. We have a set of control plane nodes that run our Kubernetes cluster with all of our microservices on top of it that manage our compute nodes. Then we have our compute nodes that just run Fuzzball Substrate. We basically can start up our container on Fuzzball Substrate with a certain set of resources, and then get out of the way and let it run as it would normally run for any other type of batch work that you might be familiar with. With that Fuzzball Orchestrate stack, we can do things like manage container images, run what Forrest will show us with our workload definitions where we execute directed acyclic graph of steps in order to complete a computing pipeline, and other things like manage data or manage policies around users being able to use certain resources or resource quotas, things like that. That’s the second level of the stack.

Then on top of that, I think we called it Fuzzball Federate. That layer is essentially taking all of our Fuzzball Orchestrate clusters and connecting them together so we can make high level decisions about how we want to schedule workflows. At the Fuzzball Orchestrate level, we’re doing things like making decisions about how we schedule directly to nodes, whereas at the Fuzzball Federate level, we’re making decisions about how we schedule two data centers. We might have several different clusters and several different locations, maybe some on-prem, maybe some in the cloud, and we want to make decisions about which cluster or data center is the best to run a particular workflow based off of different types of suitability, whether that’s resource amounts, or if it’s actually resource classifications, or overall policy about what resources that user is allowed to access. Greg, you just pulled up a slide, or someone did. Would you like to talk through that slide a little bit? You do this a little bit more than I do, from the slides perspective at least.

Gregory Kurtzer:

You were talking already to this, I just wanted to put up the visual. This is what a federated meta-orchestration platform using Fuzzball would look like, where you basically have workflows coming into this federation layer. This is where you can make policy decisions based on things like resource and architecture availability, cost, as well as data location, data gravity, data mobility, and data security. We can actually say, for example, if a workflow comes in and this workflow is requiring data that exists in, let’s say S3, we can make the decision: do we want to run this workflow up in AWS or in a particular availability region or zone, so we don’t have to incur any sort of egress cost of moving that data?

But now we’re actually incurring some sort of cost to actually run the compute. If we move the data to an on-premises cluster, maybe in that particular case we’d actually get better performance, maybe it would be less expensive. We get this ability now that we can start thinking about it at this level. To go back a little bit in terms of what Ian was describing, in terms of components and what the architecture looks like, you’ll see, if you’re familiar with traditional HPC, this is a traditional HPC compute cluster. You have some sort of base provisioning, a small operating system, and you have Fuzzball Substrate running across all of your compute nodes. This could be tens or thousands of compute nodes.

On this side, you’re going to end up with some sort of scratch file system, parallel file system, luster GPFS, or just a high performance NFS type system to share out data and volumes across this compute cluster. Now, as Ian was saying, Fuzzball Orchestrate is a microservice platform in itself. we’re going to want to run it on top of some sort of microservice platform architecture or service provider like Kubernetes. In this particular case, this is how we would ingest a workflow via an API here in Fuzzball Orchestrate. This workflow will come in; we’ll be able to basically ingress data as the workflow defines, ingress data, create the volumes, run the job pipeline, and then egress any data. What that means is, this data lake becomes the home directory storage of what we’re using today as home directory storage in high performance computing.

We still need that scratch file system that’s running here because we’re running a bunch of volumes and we’re shuffling data across a bunch of nodes; we need something highly performant here. In terms of that persistent storage, we don’t need to use some very expensive enterprise grade NFS home directory storage, which is what we have been using. Now, we can use a data lake. The nice thing about the data lake and object stores is they are much better at doing distributed and redundancy, and spreading out across geographies. This gives us the ability to leverage that data store as a distributed data lake, and now we can run jobs anywhere. We are working with partners as well in terms of organizations and providers of a distributed data lake, to figure out what’s the best way of doing this.

I will talk a little bit about the workflows. I think, Forrest, you have some demonstrations that we’re going to be running. But just to give a quick introduction in terms of what is in a workflow and how it looks. We basically need to support three different aspects in a workflow. Workflow basically is going to be consisting of defining your data in volumes and describing your job pipeline. In this case, the volume we’re going to set up, a volume named V1, will be an ephemeral volume. We’ll have some ingress, data we will pre-populate this volume with, and Fuzzball will automatically manage the movement of this data and manage the caching of this data.So subsequent jobs that require the same data will land on the same resources and be able to leverage that. At the end of the job pipeline, we’ll be able to egress things that we want to persist and retain after running this job, or the workflow. 

Now, jobs are a full acyclic graph of individual jobs. Each job has a number of attributes associated with it. NCL requires UFS. UFS requires untar, and untar has no requirements. That one will run first. Every job is going to run completely within a container. The first thing we’re going to do is pull the Alpine container out of Docker, spin up that container, and run this command inside of that container.When we spin up this container, we’re also going to mount up V1, this volume, inside that container and slash data. 

Once this is done, which is basically a data prep, we’re going to untar what we downloaded up here, then we’re going to run the UFS weather model. This is an MPI job. We’re going to run it across 128 nodes and do the complete MPI wire up using open MPI. Historically, with running containers and MPI on high performance computing systems, you’re typically running in a hybrid environment where you have part of the MPI outside the container, and part of the MPI inside the container. In this particular case, we’re actually running 100% within the container, so this container would basically run across 128 nodes. If you’re running Infiniband, that will automatically be managed and leveraged. We’re going to run this command within it, we’re going to start that command in this directory, at /run_dir, and that’s because that’s where we’re mounting volume one, from up here at /run_dir.

Then NCL is a quick visualization program that we’re going to run to take the output of this and now turn that into a couple images. Then at the very end of this job, we’re going to take those two images, and here you can see we’re just putting them to a local file system, but we can very easily upload that into something like S3, where it can be visualized globally, or elsewhere. We have a graphical environment that we’re working on as well, that will be released shortly, or when we release Fuzzball. This graphical environment can be used to both create and make workflows. It can also be used to templatize workflows, such that each workflow now becomes a very simple compute-focused application that you can now create, almost like a web form, to go create an application and run workflows where the inputs, outputs, or any other templates that you put into that workflow will be automatically replaced with the contents of what’s in that web form.

There’s a lot of options and things you can do with a system like this. Keep in mind, even when we go up to this level, that workflow will come in and we will figure out where it makes the most sense to run it, based on the policies that the organization requires and has set forth. One other quick note, there is no SSH anywhere in this system, so all of this is going over APIs that are governed by IM policies across the entire platform. Every aspect of this can be secured tightly, and system administrators and organizational policies can define what, where, and how it is allowed to run, and what are users or groups of users allowed to do with that. I’ve been talking for a while, anything else anybody wants to add to that?

Administrative Standpoint [00:27:50]

Zane Hamilton:

From the workflow perspective, that file that you showed, that’s what a user would create at some point; there will be the Gooey to help them create that. So that’s the user experience. I know Forrest is going to show us that, in a little bit, but from an administrative standpoint and setting up those policies for where things land. How hard is that? From an administrative perspective, what does that look like?

Ian Kaneshiro:

Basically, we have sets of curated policies where, for initial deployments and proof of concepts, we’re working with organizations to understand what their needs are and then building the policies with them in order to properly set user roles and manage resource access and do things like that. The policy system is very expressive, which means you can do a lot with it, but it also means that there’s a lot there to document and to understand. It’s basically a learning process where we want to set up specific roles of policies so that it’s easy for a standard deployment to work how administrators would expect based on what they need.

Then if they want to dig in and do very specific things and allow very specific things to happen, they can go ahead and do that as well. It’s really about making sure the common use case or the common deployments that administrators want to make is straightforward and you get the roles and abilities that you would expect. If they want to go on and tweak things, then they have a very expressive system to do that, they can get it exactly as they’d like.

Zane Hamilton:

Excellent. Thank you. Now we are right at half an hour. Forrest, I want to leave you enough time to make sure that you get to show us what you brought. If you want to go ahead and show us your demo.

Forrest Burt:

Sounds good, everyone. I’ll go ahead and share my demo, or my terminal screen here really quickly. Give me just one moment.

Migration Path [00:29:50]

Zane Hamilton:

While he is firing that up, we got a couple of other questions. The first one is: “What does a migration path from existing end users from a traditional HPC environment look like?” That’s probably another question for Ian and Greg.

Ian Kaneshiro:

The migration path would really be about using either a proof of concept environment that you might set up. You would set up a side environment from your production and give users access to that to start to test it, or you would work with us and give users access to our cloud deployments of Fuzzball so that they can get familiar with the platform and get a feel for how it works from a user aspect. For an administrator managing infrastructure, it would be a hard switch in terms of having your nodes be managed by Fuzzball versus whatever your current infrastructure is. From a user aspect, there’s definitely paths to migration where users can get familiar with the platform, understand how the workflows work, and even use the same containers that they’re currently using, if they’re using Apptainer containers on the Fuzzball platform as a part of their workflows, that Forrest will demonstrate, and get familiar with the system.

Zane Hamilton:

Great. Then the other one real quick. Yes, Fuzzball is cloud vendor agnostic, it doesn’t really matter. Whenever you say IM, Greg, policy sounds like an AWS term. I know we’re not talking about AWS terms.

Gregory Kurtzer:

Sorry about that. We are using the same acronym for identity and access management, but it is a policy that we have created specifically for Fuzzball. It is not related to AWS IM at all – just in name.

Robert Adolph:

However, I would add that one of the design principles is that the end user and the administrator have the exact same experience, whether it’s NAWS, GCP, Azure, on-premise, or connected as a unified platform. It does not matter if IM policies do transfer; no matter where they’re running them, the end user and admin experience will be unified. That’s extremely important to what we’re trying to accomplish.

Zane Hamilton:

Thanks, Robert. Thanks for the questions. All right, Forrest, back to you.

Demo of Workflows [00:32:09]

Forrest Burt:

If everyone can see that alright, I’ll go ahead and start into this. I’m here on my command line, as Greg mentioned, I’m not SSH to anything; I’ve just gone ahead and done the Fuzzball login process, which is a web-based flow, essentially. I’m going to go ahead and run a couple of workflows here. The first is going to run a Jupyter Notebook, and I’m going to show you how you can interact with that while it’s running on a compute node. Then the second we’re going to do is more traditional batch processing with a molecular dynamic suite called LAMMPS. I’ll go ahead and start this workflow off here, and here on the CLI, we do Fuzzball workflow start, and then I provide the name of the workflow and my account information. We’ll do Jupyter-ai-training, and we’ll just say that this is the first one of those we’re going to run. We’ll specify the path to the workflow YAML file that we’re going to be using that contains the workflow that Fuzzball is going to run.

You’ll see that this workflow has now started. We can go ahead and do Fuzzball workflow status, and then we’ll watch this to watch it as it runs. 

Just to explain what exactly we’re looking at here, you’ll see this wraps around a little bit, but this last column is basically just any errors that happen with any of these workflow components as they run. We have three different things represented here: workflow, volume, and image job. Workflow over here is just giving the actual status of the entire workflow itself. Volume and image are giving the status of the setup of the data volume that’s going to be attached to this workflow, that’s going to be able to store data in or bring data out of when the workflow is finished, and also persist data in between workflow jobs. They can each access that same data that’s been pulled into this workflow or generated from somewhere else in it.

You’ll see that we have also an image here that’s being pulled, this is coming down from a GitLab registry. I’ve provided some secrets inside of this workflow that are essentially my credentials to access this GitLab registry, which is what gives me the ability to be able to pull that container down, because otherwise you wouldn’t have access. You can also see one job here and that’s the job that, here in a moment, I’m going to do some stuff with so that I can get access to the Jupyter Notebook that’s running in it. At the moment, as we run status, and as mentioned, we can watch this, but with the width of the terminal it’ll print a little weird. For clarity, I’m just going to keep it like this.

You can see that at the moment, this job is basically just sitting here pending. What’s going on is that we’re essentially provisioning this node right now so that we can launch this job here on it and be able to have access to it within this notebook that I’m going to show you in a moment. We’re essentially just waiting for this job to start. While we’re waiting, do we want to go through the workflow definition that backs this really quickly?

Ian Kaneshiro:

I’ll just jump in and say that the reason why this is taking some time to start is because on the back end, we’re actually provisioning a VM within a cloud provider in order to run this workflow based off of the resource constraints specified as a part of the workflow definition. When you show the workflow definition, we’ll see the resource constraints. I think we should do that next, but I just wanted to say that if we’re working with a static node pool of on-premise infrastructure, it’ll start running immediately, unless there is a queue for those nodes.

Forrest Burt:

Thank you, Ian. Very good to point out there. Really quickly, I’ll take you through what this workflow definition here looks like. You can see, this is basically just a flat YAML file. As Greg showed with that example earlier, we have a few different sections. We have basically just a version header that tells what version of our DSL this is based on, and then we have a list of jobs that are going to be executed when this workflow runs. In this case, we basically just have one job and that’s named Jupyter. You can see we have a bunch of different things inside of this job that are being done in order to properly set it up. First off, we have the container image that this job is being pulled from, and that is landing on that compute node and that the actual computational job here is being run inside of.

You can see, as I mentioned, this is pulling from GitLab. This is just a custom container that I’ve put together that has everything we need in order to do the notebook that we’re going to do in a second in it. You can see I’ve also provided some secrets, basically some credentials there. I’ve also stored some of those secrets inside of our Fuzzball cluster for easy templating when that workflow runs. I don’t have to manage those credentials in an unsafe manner myself, and for ease of use. You’ll also see that after the image, we have this command here; this command is what’s going to actually run inside this container when it all starts up on the node. In this case, we’re basically just launching a Jupyter Notebook instance on port 88, with a couple of other options there.

We also do a few things to set up the other parts of the environment here; we have some policies that are set. We basically just have a timeout policy, that means that this workflow will stop automatically after 30 minutes. Then we also have some environmental variables here that’ll be injected into this container so that anything running inside the container can access them as well. You’ll see that we are mounting some type of data to this. You’ll notice that in addition to the job section, we have a volume section. This is actually allowing you to bring up a storage place, as I mentioned, for workflows to use. This is an ephemeral type volume that will disappear and basically only be active for the lifetime of this workflow. This is a scratch space to bring things into and use as a space for jobs to operate on the same data within.

You can see we set that volume up there and then we attach it to this job right here at this location, which I’ll show you in just a moment, once we actually hop into this notebook. Then here, we also have the resource counts that this is going to be looking for. This allows you to set on the fly exactly what resources you want available to a workflow. Then the scheduler will do all the logic in order to figure out where the best place for that workflow to run is. So you can see, we have CPU memory and devices here. CPU allows you to specify the number of cores that you want, and then how you want those cores to be selected from the compute node. In this case, we’re doing NUMA affinity, so this will minimize the number of NUMA nodes that are being used on a given compute node, if it supports a NUMA architecture.

That’s efficient, trying to land things all in a space where all those cores are going to be able to access the same memory location with NUMA. We have other affinities that you can choose that are a little bit more broad and less granular there. We also specify the amount of memory that we want in GB, that’s going to give us 12 GBworth of RAM that this workflow will be able to use. We also specify that we want an Nvidia GPU through this device’s specification. This basically is just going to reach out and look for a node that says it has a GPU attached to it, and then serve that back to us so we can use that in our workflow.

As this is provisioning live, sometimes it can take a little bit of a variable amount of time for the node to come up. I’ll status this again and hopefully we’ll see that this job is started. That means that the node is brought up, but it hasn’t yet, we’re waiting on this. This typically takes about 5 to 10 minutes. We’ll probably be waiting on this for a second longer, but it shouldn’t be too awful much longer.

Gregory Kurtzer:

I can jump in on this question – “where is the ‘thing or setting’ that determines on which cloud environment all of this magic will happen?” while we’re waiting for that.

If you have put a cluster in AWS, for example, we could be leveraging that cluster. If you have a cluster that’s on-prem, you can run a federation or Fuzzball Federate on top of both of those. Then Fuzzball Federate will decide which cluster it goes to. There is no infrastructure within this demo that is like a cloud SAAS type solution, although we have been asked to do this. This is going to come online, where we are going to be hosting a Fuzzball cloud for people if they wish to run on top of that. Then we would basically make determinations in terms of where things would run, again, based on policy. For the moment, it is going to wherever Fuzzball is currently installed.

Forrest Burt:

It looks like we are still waiting on this.

Of course, my tests are usually five minutes or so. We’ll see how long this ends up going for here.

Gregory Kurtzer:

There is an option as well that we’ve used in the past, and we’re not doing it now apparently, where you can actually provision resources to be spinning and be up and running on standby. Most organizations, when we’re talking about the cloud compute side of this, really like to keep all of their costs down to a minimum, and that’s one of the nice things about cloud. A lot of organizations that we’re talking to, especially if they have multiple resources, have an on-prem resource, they’ve got a cloud resource. The on-prem resource is statically defined; it already exists. Typically, you’re going to have all of that running and available and ready. 

On the other side of this, if you have a cloud side, a cloud cluster, you don’t want to keep any compute nodes running just for the sake of keeping them running. That will automatically shrink down to just the bare minimum requirements for Fuzzball to operate, which is really just that Kubernetes cluster. We’d sit on top of EKS, or whatever you happen to be using for your Kubernetes infrastructure up in the cloud, and run there. Then as soon as a job comes in, we would automatically provision a new cloud instance for that job meeting the needs and specification of the workflow coming in and whatnot. Hopefully that answers the question. I see that it has started, I’m going to hand the mic back over to you, Forrest.

Forrest Burt:

We’re good to go. That node is provisioned, as you can see by the status over here listing as started. That means that this job is actively running on the provision node, that we should be able to access it and do some different things with that. The first thing that I need to do in order to get the link and token that I need in order to access this Jupyter Notebook is just to open up the log of this workflow really quickly. You can see we have some output there, we’re going to be looking at this, namely. Because this is running at a separate data center on a public cloud provider, I actually have to port forward this workflow back to my laptop here in order to be able to get onto this notebook that’s being run here.

I’ll just go ahead and do the Fuzzball workflow port forward, and then we’ll provide the name of the job and then the name of the workflow. I’ll provide the name of the job in that workflow that we want to connect to, and then a local and remote port to forward from. We’ll run this and you’ll see we’re listening. I’m going to go ahead and switch over my sharing really quickly, you can more easily see what I’m doing here.

Zane Hamilton:

It’s important to bring up that it’s not an SSH port forward, either. You get asked that question quite often, whenever Forrest is going through this. It’s actually API TLS.

Ian Kaneshiro:

We’re using GRPC as the foundation of our API, and so we can use a streaming RPC setup to pass the data from your container process all the way across a couple proxies when we do these federated setups to your terminal. That’s how port forward works, when Forrest shows something exactly into the container as well, that’ll also be done in the same way. We’re not opening up SSH ports to the node itself, we’re doing everything over our API.

Gregory Kurtzer:

I think it’s also cool to note as well, that port forward is what we’re using  for Jupyter, but you could use that for pretty much anything. I mean, we have our customers that are using that for remote desktop, accelerated desktop, visualization, and VDI. You can use it for all sorts of stuff. anything that does network IO access you can actually do that port forward. That GRPC API, the streaming API, gives us the ability to not only do ports, but we can also do pseudo TTY. You can actually get a shell in one of the containers that, from a user perspective, you may not even know if that’s running up an AWS or what availability zone it’s running in AWS, or if it is running on-prem. All of that is completely transparent, but it all just works, bringing it right back to the user’s workstation, exactly as they would expect. All of this can also hide behind a graphical environment, which we’re working on right now to simplify.

Ian Kaneshiro:

One thing I’ll add is, because it’s through our API, this directly interacts with our IM system. If an administrator wants, they can completely turn off port forwarding or access to running containers and put a curtain up, essentially. If you have requirements around what machines can view, like the data being operated on by a container, you could completely turn off these types of features and only allow workflow submission and nothing else. 

Forrest, you’re probably ready, but real quick, I think we should answer Stefano’s question. In order to get started with Fuzzball, we need a cadence cluster and compute nodes. If this isn’t optimal for performance or contact switching, we could have those within the case cluster, but outside is better, a storage system that allows all the compute nodes to have the same view of data. That’s all you need to get started. If you’re interested in looking at more, I would have you talk to Robert about a potential POC, or anything like that. Go ahead, Forrest.

Forrest Burt:

To make sure that we’re all clear on exactly what we’re doing here, we’ve had this workflow that’s run here; I’ve gone ahead and got the log from it. Then I went ahead and port forwarded that workflow back to my laptop. When I control click this, you see that it comes up here in my web browser, the homepage of this Jupyter Notebook. This is a full interface into this compute node that I can basically use for whatever I want to do in Jupyter. I’ll go ahead in this case and upload a notebook that I have from my local machine here, and then we’ll go ahead and run that. We have a PyTorch demo notebook here.

We’ll go ahead and push upload there, and you’ll see that it has gone ahead and uploaded. This is now sitting here on this compute node. I’ll go ahead and click this to open it. Well, first off, I’ll grab a terminal into this so we can show you something else I’m about to do here in a moment, but I’ll go ahead and open this up and we’ll see that this notebook opens as you would expect. This is a terminal that’s sitting on that compute node that we were just on. I’ll do Nvidia SMI, you can see that we have a GPU available, as we’re going to be doing some GPU based AI training here. I’ll also swap back over to this really quickly and show you our empty data directory. This is where that volume that we attached to the workflow was put on or was landed inside of the container.

This can be written to, and we can persist data between workflow jobs inside this directory. I’ll go over here and we’ll take a look at this notebook that we have. This is basically just a PyTorch tutorial notebook that does some AI training. I’ll go ahead and run through the code in it. We’ll do some quick imports. Fuzzball has a very robust ingress system that allows you to bring in data from different places and object storages. It also supports custom data movements, with inside jobs themselves, if that’s something you need to do. 

This should be a quarter of the way done by now, so we’ll give this a moment. Maybe we have another question or something we can discuss in the meantime.

Can Warewulf help with Provisioning? [00:51:20}

Zane Hamilton:

This is a statement and a question. It’s talking about provisioning, sounds like setting up Fuzzball could be tricky. The next question that follows is, can a tool like Warewulf help with that, or are there existing cloud formation templates to assist with that provisioning?

Gregory Kurtzer:

This is not a talk on Warewulf, but thank you for bringing this up because it’s super important. One of the things about Warewulf that allows, that really facilitates how you would basically use Warewulf to stand up a custom resource is being able to pull in…

Because Warewulf basically takes OCI-based containers and we provision them out to bare metal, it’s very easy for us to basically set up things like node images, things like even Kubernetes images and be able to splat that out to bare-metal hardware, spin that up in such a way that we can facilitate how it basically sets up these resources. There is a significant amount to actually set up a Fuzzball cluster. Aside from the Kubernetes piece, there’s nothing that you would expect that is more than a traditional HPC system. You still have a shared file system; you still have to deal with the network, InfiniBand, provisioning out those resources, and Warewulf will help with that. We will be offering images for Fuzzball that you can just basically download, import those into Warewulf and then blast those out to compute nodes. If you’ve got a thousand plus compute nodes, Warewulf will be able to provision those out very, very efficiently and get you going with Fuzzball, at least on the compute section of that cluster, very easily.

For the resource management of that cluster, you will have to have a Kubernetes set up. As Ian mentioned earlier, it does not have to be in direct proximity. As a matter of fact, we do have one use case at an HPC center right now going on, where they basically stood up the Fuzzball microservices, Fuzzball Orchestrate basically, within a VMware resource. They’re within IT’s VMware setup. They basically had Kubernetes running there. They installed all of the Fuzzball requirements for Orchestrate there, and then they basically brought in Substrate on a system that’s even on a different routed network from that VMware system. They’re basically able to bring up the compute nodes over there. Then the compute nodes are talking to the Orchestrate cluster that’s back over in VMware. It’s a cool way of being able to build this. If you do already have a Kubernetes cluster, we can definitely leverage that. 

Forrest Burt:

It started moving a lot faster than it was there, we’re good to go. We’ll continue on with this. We’ll show a couple of those training images that we just downloaded here; we can have a little bit of an idea of what our data set looks like. We’ll go ahead and actually set up our neural net that we are going to be using for this. I’m going to run some of the other code that we need. As I mentioned, this is sitting on a GPU node. You may have noticed in the workflow that I showed there, I was specifying that I wanted an Nvidia GPU. As I indicated over here, I can also use Nvidia SMI, and we have one of these Tesla T4s on this node. I’ll go ahead and start this code right here, which is actually going to train this neural net on that GPU. 

This should begin giving us some output here in just a second. You can see this is starting to print the status of training. We have a loss value over here that basically represents how well trained our model is, and we’re going to do two epochs. We’ll see this stop here in about a minute. If I move over here, I can do Nvidia SMI again. You’ll notice that because of quirks of how containers work, we don’t see exactly what process is running here, but we do have an indication that there is a process running here because we no longer see it; there’s none running. We also have 1142 MB of GPU memory taken up, which is essentially our model sitting on that GPU. We moved back over here. We’ll run this one more time. You can see our utilization over here is at 24%, indicating that this GPU is actually running at the moment. You can see that this is still training on that. 

You can see that from this top level view of everything, if I go into the data directory within this code, that’s essentially where I told this to download to. This is now in that volume that we set up at the start of this, and you can see that data that we spent a little bit ingressing there from within our code. We’ll finish training. We’ll go ahead and save off the trained model that we just worked with. If we go back over here, you’ll see that model right there is available.

I can also do things like download that directly to my laptop from this interface as well. If I want to move things around there, I can do that. If we go over here, we’ll show a few images from the test set. We’ve got a cat, a ship, a ship, and a plane there. We’ll do some code really quickly to get these test set images to the model, and then we’ll go ahead and actually run essentially some live inference over our test data set. You can see our model has predicted that this is a cat, a car, a ship, and a plane, which is not too bad for something pretty basic like this. We’ll also go ahead and run the neural net that we just trained over the entire model.

We’ll get some information on how accurate the model is over the entire data set in general. You can see it’s about 55% accurate. We’ll go ahead and run this as well, which will give us the output per category that’s in the CIFAR-10 data set we’re using to train this. You can see that for each one of these different categories, we have a value for exactly how well the model was able to recognize that type of object. This is all running via Fuzzball workflow. This is sitting on a node at a major cloud provider’s data center, and you can see that we’re able to just, with a container that I’ve created, instantly bring it up and then upload material and stuff from my own laptop. I could ingress things in from the internet to it, but as you can see here, I’m able to open this workflow, open the Jupyter Notebook inside of it, and then upload something from my own laptop and be able to run that effectively using a GPU as well. That’s the Jupyter Notebook through Fuzzball.

Why is it named Fuzzball? [00:58:51]

Zane Hamilton:

Excellent. We’re getting close to time here, guys, but there is one question that I get quite often, Greg, and I’m going to direct this one to you because it is something I get asked pretty much every time I talk about Fuzzball and is: why is it named Fuzzball? Where did the name come from? What inspired it?

Gregory Kurtzer:

Robert named Fuzzball.

Robert Adolph:

It was essentially a play on Pandora’s box to get us to Fuzzball as containers and the singularity that Greg founded is a piece of that. Obviously, Fuzzball is a singularity as well, but in string theory. Then also it ties together everything into one. That’s really what we’re doing here, tying together multiple different resources into one and giving end users and admins an extremely efficient and similar experience across everything, as well as providing applications the ability to have freedom to run wherever it is best for the applications to run. This is what’s really important about Fuzzball. It gives the power back to the researcher, to the administrators of those researchers, and gives them the ability to really do the science that they want to do. That’s really what we want to do, is really enable those folks to do what they do great. At the end of the day, that’s really where Fuzzball came from, and it’s really our ethos of what we’re trying to do as a company.

Gregory Kurtzer:

I’d only add as well, we’ve got the cutest little mascot too. It’s a little fuzzy that you’ll see come out here shortly.

Zane Hamilton:

Tron’s asking, how can he get his hands on Fuzzball? I will reach out to you, Tron, and I will work with you and we’ll figure out how to get you access to Fuzzball. I appreciate you guys, I know we are actually about a minute over. Thanks for the time today. We appreciate you joining us again. Remember, like and subscribe, and we will see you next time. Thank you very much.

Gregory Kurtzer:

We are growing fast. We have a whole slew of open positions right now. If this looks like something that somebody is interested in being part of, please go to our website and check that out.

Zane Hamilton:

Absolutely. Thanks again.

Gregory Kurtzer:

Bye.