Adding A GPU node to a K3S Cluster

I recently wanted to add a GPU node to my K3S cluster and found the documentation a little lacking so I wanted to just quickly capture how I did it, so that should I need to do it again, I can refer back to it. And if anyone else finds it useful too, then all the better.

Installing the node.

This is my fist dive into working with AI to build software. Like everyone else I have been very impressed with ChatGPT and all the other buzz around AI for the last year or so, and have used it quite a lot. What I have not done though is try to integrate it into my own apps. When I started to look at doing this I was disappointed with the performance of the GPU in my laptop due to the limited VRAM available, and I was not keen on watch the costs of using the ChatGPT API shoot up, and knew based on what I wanted to do that I would hit the rate limits quickly. Instead I decided to build a dedicated server in my home lab to experiment with running my own Large Language Models.

I began by scrounging an old desktop from a friend. It is an 8 core i7 3.6GHz with 16GB RAM, then I purchased an Nvidia RTX 4070 ti super graphics card with 16GB of VRAM. This seemed to be about the best value for money in terms of performance and available VRAM. Then I installed Ubuntu 22.04.4 and began setting it up to join the K3S cluster.

Installing the Nvidia software.

After setting up SSH the first thing I did was to install the nvidia software. This was the easy part as you can just follow the official Nvidia documentation. Start by setting up APT:

And make sure you enable the experimental features.

Update apt and add the required packages.

Setting up K3S

Next install K3S and join it to the cluster

Next we need to update the containerd runtime to recognise the GPU

At this point it is worth verifying that you can run a container on the containerd runtime

This should prove that the container can see the GPU. Now all that is left is to configure K3S to use it so we can run our AI containers in Kubernetes. Start by creating a RuntimeClass manifest.

And deploy it to your cluster

Now we need to deploy the nvidia-device-plugin damonset. Because this is the only node in my cluster that has a GPU, I did not want the Daemonset to be deployed to all nodes, so I first added a label to my new node

And I edited the nvidia-device-plugin available from here to have an affinity matching my label, and to use my new runtime class

Once this is deployed and running on the new node you can verify that your done is ready by running ‘kubeclt describe node nodeName>’ and checking to make sure that ‘nvidia.com/gup’ is listed under both Capacity and Allocatable. If it is you are ready to deploy a pod that has access to your GPU.

Deploying your first GPU pod

I used the following simple test manifest to check that pods could access the GPU. Note that the runtime class is referred to again.

The pod should be correctly allocated to your new GPU node, and upon inspecting the logs you should see the output of ‘nvidia-smi’

Leave a Reply

Your email address will not be published. Required fields are marked *