Install our extension to search inside any video instantly.

Talos Linux 1.13 features and upgrade process
Added: 2026-05-06

698 views321:22:25SideroLabsOriginal Release: 2026-04-29

Talos Linux 1.13 masterfully bridges the gap between immutable security and operational flexibility with its new node-level debugging and image verification. It remains the gold standard for anyone serious about running Kubernetes without the technical debt of a traditional operating system.

[00:00:01]All right, there we go. Hello, everyone.

[00:00:03]I am Justin Garrison, the field CTO at Sidero Labs.

[00:00:07]Thanks for joining this live stream or watching the recording of this, whatever you're doing. If you are here live, leave a comment. I should show up here on screen for everyone.

[00:00:17]Um and I know there's a delay. It's always streaming is always fun. And I'm still not sure if I will see the LinkedIn comments. If anyone's watching this over on LinkedIn, let me open it open the event over there because sometimes it isn't great.

[00:00:34]>> [laughter] >> Let's just say I don't understand how LinkedIn routes comments around. So, yeah, if you're here for that from LinkedIn, please leave a comment. I've found the events.

[00:00:46]There it is. I might be having an echo here in a second.

[00:00:49]Um cuz I always have to go over there to find the comments.

[00:00:54]There's comments. Okay.

[00:00:56]But no one's no one's commenting yet.

[00:00:58]Uh sweet. Okay.

[00:01:05]Thanks for Thanks for joining. I have no idea No, that comment said it's coming from YouTube. That's it gives me the icon there. So, I know it's working. All right, Peter, thank you. I see now. I don't see the comment on LinkedIn, but I see it here. You have the LinkedIn badge. We're good. We're good to go. Um so, I wanted to take this time to honestly like learn some of the new features myself. We're We're all involved on building out Talos features and testing them and what nots, but we don't test everything. We can't validate everything on every piece of hardware in every environment. And so, I was like, "Hey, Talos came out yesterday. Happy Talos release day."

[00:01:41]And I just wanted to spend some time looking at the features myself and testing them in my home lab. I have spare computers that I can test this on with various hardware. And so, I wanted to highlight a couple of the features.

[00:01:53]Ones that I know I can test and some that I'm going to learn here on the stream how they work, how you would validate them, and what they do. So, um yeah, just wanted to just wanted to do it. And anyone that's, you know, watching the stream afterwards, we should be able to, you know, share this around so that people can see what's what's in Talos and what nots, but let's dive right in.

[00:02:15]And oh, I don't like that layout. I'm going to go up here. There we go.

[00:02:19]Uh get me a little out of the way and and you should be able to see our docs pages.

[00:02:24]Um if you have a Talos cluster and you're watching live, let me know what what version you're on.

[00:02:30]Uh I am kind of curious to know just are people keeping up to date? Is this, you know, most people are still on on 1.11, 1.10 and catching up? Um just would love to know.

[00:02:40]So, if you're in the docs, the main place to look whenever a new Talos version's out is here, what's new in Talos.

[00:02:47]We'll list some of the highlights. This isn't everything, but it does give you some things, obviously. And then obviously, you can go to any version here in the drop down and then get to that what's new still, which shows you kind of when things came out, when they landed, what the configurations look like, all that sort of stuff.

[00:03:04]And then always jump to the latest here.

[00:03:07]Uh 1.12.4, sweet. You're late. I 1.12.7 came out yesterday, too. Geez, you could >> [laughter] >> Uh Yeah. Oh, yeah, and Pi 5 support. That did That did land in the in 1.12 range, but it wasn't like a specific release. So, it was It was more of a overlay feature that we added in Image Factory. Uh rather than a specific Talos thing. So, oh, and sweet. Moving 1.10 to 1.12, you're going to have a bunch of new features. There's a whole bunch of stuff from 1.10 to 1.12. And you again, you can just look at that from these change log notes.

[00:03:42]Um so, one of the first things I wanted to show off was Actually, I have a a test cluster here. Um this is I'm going to do most of this through Omni just because it makes it visually more appealing and easier to kind of troubleshoot some of this stuff and see what's going on. So, all of this stuff again, it's it's standard Talos features. This is Omni, the our paid service that we manage Talos stuff for.

[00:04:07]So, if you have a bunch of clusters, I have This is my personal one. So, like I have my like this is my production home lab cluster.

[00:04:14]It's some Some components are here, some components are over here, some are in another network closet. So, I got them all over the place, but there's, you know, four nodes in this my main cluster. And then I spun up a brand new Talos cluster with six nodes, which is again, there's one one right here next to me on my desk. There's a laptop back here. It's a Framework 12. Um and then there's machines around. So, I just wanted to get that up and running.

[00:04:41]Um so, yeah.

[00:04:43]Oh, and someone else We got 1.11 with the unofficial non-official Talos. Um start with 12.4, sweet. So, yeah, and and we do have Let me actually just show that real fast to everyone that doesn't not know. Pi 5 is supported. If you go into Image Factory, bump that up a little bit. Single board computer, select your version, and there is a separate Pi 5 series.

[00:05:10]So, if you have a Pi 4, you want this one. If you have a Pi 5 here, we would love to, you know, make that easier.

[00:05:16]There was a bunch of stuff that I don't know all the details on that overlay, but there's there are things that only apply to the 5 that don't want to We don't want to apply to the 4, so we kept them as separate things. So, if you have a Pi 5, build off this image. I am running it on a few Pi 5s.

[00:05:31]I haven't done extensive testing with it. So, it's I just was verifying that things boot, I could provision a cluster, that sort of stuff works. You can't do things like boot from NVMe because that's a U-Boot doesn't support that. So, there are things there that that don't work.

[00:05:47]I don't know if anyone's buying new Pi 5s with the the new price hikes.

[00:05:51]They're like $150, $200 now for something manageable. So, there's a bunch of other boards in here that I have tried a few of them. I like some of them a little more than Pi 5s. The board on my desk is a Iota.

[00:06:05]It's a LattePanda Iota, which I really like. I did a review for that on my personal YouTube.

[00:06:11]It's It's a good It's an x86 board, which I again, I just I really enjoy that's how that's working. So, that's one of my One of my nodes here is an Iota. I don't know which one. Um but let's start by just actually kicking off a a quick upgrade cuz I don't get any of these features until I actually do the upgrade process. So, I'm going to go all the way up to 13.

[00:06:33]And we're going to do that Talos upgrade.

[00:06:35]It's going to go through and just cycle all of them all of the nodes in the cluster.

[00:06:40]And yeah, it'll It should be done here in a in a couple minutes. While it's doing that, we will look at some of these what's new posts. Can you all see this comments on the stream? Is it because of the background here, is it confusing? I could take it off if it if it just looks weird. I know like on the docs, we usually have like that sidebar. I wish I could change the background to be solid or something.

[00:07:03]I don't think I can.

[00:07:05]So, let me know if you >> [snorts] >> if that if that's confusing or if you like seeing the comments or nots. I know some people, depending on where you view it, you don't always see the comments, so sometimes it's weird seeing me reply to something. If I forget to like pop in the comment like this, it's it's hard to keep up. So.

[00:07:21]Um one of the features we should tell us control debug. I love this. This was something that we worked on with some people that were were doing similar things. They wanted to uh figure out how to debug Talos at outside of a Kubernetes cluster. Like if I don't if I can't run a privileged container from Kubernetes, if I don't have a Kubernetes API, how do I debug and get a shell to Talos to poke around just to see what's going on? Um Oh, you can use Oh, yeah, sorry. I did not clarify. You can use NVMe for peripheral source for extra storage.

[00:07:58]Absolutely, that works. You just can't boot from it. So, all of my Pi 5s, I boot from SD card. Talos is is pretty low like right heavy on those SD cards. I don't expect them to die. And then I put my ephemeral storage on my NVMe drives.

[00:08:12]Um just so that all of my disk images, my containers, all of that stuff, they go into NVMe. So, good Thank you for pointing that out. Yeah, it's just you can't boot from it.

[00:08:23]Uh so, yeah, so Talos control debug is a new sub command that lets you run a container as like a privileged container directly on the node from Talos control.

[00:08:35]And And there's two ways of doing this.

[00:08:37]One is you can do Talos control debug, give it an image and a command.

[00:08:42]Just runs it, right? It pulls it down, runs a container, gives you an interactive shell, just like kube control debug, similar to I guess kube control exec, but kube control exec does it to an existing container.

[00:08:54]The other way you can actually do this in a like an air-gapped mode where if the node itself, if like container D can't pull down an image, if it doesn't if for some reason it's it's failing to pull, you know, your upstream registry or something like that, you can actually tar the file and push it directly to the Talos API. So, you can locally, like from your Talos control, I need you to get this container, and it will like pull it from It'll like get it from the command line itself. Uh if I have any of these It looks like this one probably already stopped already started running and is already up to date on 1.13. So, let's see if I can try it.

[00:09:28]Um where does my version show?

[00:09:31]>> [laughter] >> Kubelets.

[00:09:34]Why am I missing my version here? I have 18.

[00:09:39]Oh, no, this is 12 12.6. Uh let's see what the other one One of these have to be updated already.

[00:09:45]Here you go, this one.

[00:09:47]GVA.

[00:09:49]So, GVA is already updated. Oh, I want that.

[00:09:53]Click this, copy machine image, and now let's see if this works.

[00:09:57]So I have um Talos control version. I have 113 already there.

[00:10:04]So if I do debug and help, I will see the command I'm looking for. This is you know our new sub command. We get some examples there. The tar is for like running it directly. Let's let's run this this command.

[00:10:20]Oops. I need to copy it first.

[00:10:23]That's the node I want.

[00:10:29]I want I don't know if I can put node here on the end. Let's try it.

[00:10:39]What? Not method not allowed.

[00:10:42]Um this is something that I thought was only happening on Let's see. Let's try a different node.

[00:10:50]Can I do it by name?

[00:10:51]I should be able to.

[00:10:58]No. Okay, this is what I was actually concerned about. I didn't know. I do not know for sure if this works on an Omni managed cluster.

[00:11:08]Um which I was curious about because Omni does restrict some of the things you can do. Like you can't read from the file system because Omni has certificates that are protected. We don't want you to ever leak those.

[00:11:23]Um is let's see if it's only a control plane problem or if it's a worker as well. Are any of these upgraded?

[00:11:29]Nope.

[00:11:31]Nope. Nope. They're not upgraded yet.

[00:11:33]Okay, my next option here is I actually have Where is my Jack KVM? This one is booted 113 in maintenance mode. I think this is my Which node is this? 130. This is my This is my Nvidia Spark.

[00:11:50]Um so my Nvidia Spark is running. I wanted to show off the Nvidia stuff later anyway.

[00:11:55]So I'm going to convert this into a single node standalone Talos cluster and we will we will run off that and see how it works.

[00:12:03]So gen config spark 101130 6443 Um install disk is dev nvme 0 and 1. You can tell I've done this before.

[00:12:23]>> [laughter] >> Just a couple times. Um I think this will automatically give me 130 or 113. Um I don't think there's anything else I wanted to apply to this machine.

[00:12:40]Architecture is going to be arm. I don't need the GPU operator yet. I just want to debug it. Let's see if this works.

[00:12:51]Um 130. I'm going to get the kubelet 136. So that's going to be the default for this version.

[00:13:00]Install image is correct and we're getting 113 for the installer. Okay, so I think that's going to work. Um It's going to be a big old ephemeral 4 TB ephemeral drive, but uh we should be fine.

[00:13:13]Apply insecure to node 30 file control plane.

[00:13:23]That should give me a full install on this node.

[00:13:26]Oh, and and one thing to point out debug doesn't work in uh maintenance mode. So if I want to do like I node Let's I don't think if I say just Alpine, I don't think it'll work.

[00:13:48]I think I might need the whole thing.

[00:13:50]Let's see.

[00:13:51]>> [snorts] >> So if I do it like insecure node access, so I know no certificate no trusted store. It's not actually installed.

[00:14:03]What? Why did it Oh yeah, it doesn't even know about the insecure flag.

[00:14:07]It's like I don't don't not going to work, right? So I do have to do an install first.

[00:14:13]This we do have another feature. I'll show it after I do the install. Let's just apply this.

[00:14:20]Let's apply this and get this going first cuz it will take a minute.

[00:14:26]>> [snorts] >> So if I go back over to there it's an installing phase here.

[00:14:36]We see all our components show up. So we know it's it's applying. There you go. There's our We did a disk format.

[00:14:51]And this is also the node that I'm going to test all the new Nvidia stuff on.

[00:14:56]Which is the the most difficult one to do this testing on.

[00:14:59]Um my virtual media is not plugged in.

[00:15:02]So yeah, that's going to be a standard 113. It's installed the disk.

[00:15:05]>> [snorts] >> I am going to have to change kernel arguments to make this work and also upgrade the image. So we're going to have to build a custom image and upgrade it and we'll show that process.

[00:15:16]Um so we're booting here. We should get the error or the the notification to run the bootstrap command.

[00:15:24]Um I'm going to switch back to my terminal and export my Talos config.

[00:15:33]I also generally just endpoint 10.1.1.30 Oh, oops. I need to say endpoint.

[00:15:49]What?

[00:15:50]Oh, why did I have an e flag? I don't know.

[00:15:54]Oh, cuz yeah. Never mind.

[00:15:56]Endpoint I'm I'm writing this to my config just so that I don't have to specify that IP address. It's a single node cluster. So now if I do like Talos control dashboard I What?

[00:16:07]What?

[00:16:08]Hold up.

[00:16:11]Why is it unknown? Did I not export it properly?

[00:16:22]Oh, look at that. That's not at all correct. What?

[00:16:30]Oh, I just saw one of my my node next to me just rebooted. It's going through the upgrade process. Like what happened over there? I didn't touch that. Oh yeah, Omni's still doing that in the background. Okay, wait. So why don't I have uh Why Why doesn't this have certificates?

[00:16:54]What?

[00:16:59]Let's do it again.

[00:17:01]Did I >> [snorts] >> Did I break something?

[00:17:06]What am I going to have to do here? I'm going to have to uh Okay, that has certs.

[00:17:13]I must have I must have broke something when I Okay, let's get back to our virtual media. Let's mount this.

[00:17:20]Restart it.

[00:17:21]113 metal arm mount.

[00:17:25]Virtual keyboard. We got to wipe it.

[00:17:30]Always fun times.

[00:17:32]I didn't create a proper config. Really?

[00:17:34]It didn't because I because I had a wrong error. Is that?

[00:17:38]Uh delete.

[00:17:40]Um Override next boot to that.

[00:17:47]And now reset. Okay.

[00:17:50]>> [sighs and snorts] >> I I knew it was something I messed up.

[00:17:52]It was just confusing.

[00:17:53]>> [laughter] >> Now this is going to come back up. I should be able to run that endpoint command and node command and apply config. We're doing insecure again.

[00:18:13]What happened to my HDMI? That's not cool.

[00:18:16]>> [laughter] >> Oh, it's rebooting again. Yeah, there you go.

[00:18:19]Should have my virtual media.

[00:18:25]Is this Is that right? We'll see We'll see what it comes up with.

[00:18:38]Okay, good. It's maintenance mode.

[00:18:40]So where we want it. We can unmount that.

[00:18:44]And then we'll apply this again.

[00:18:48]Okay.

[00:18:49]It's going to There. Installing again.

[00:18:50]Let's check on the cluster.

[00:18:55]Uh this is all upgraded. Everything over here is upgraded running running ready to go.

[00:19:01]Um Let's let's do Kubernetes while we're at it. I have no idea what's in 136 except for user name spaces.

[00:19:09]But sure, it's a test cluster. We're We're going to upgrade that as well.

[00:19:13]And I'm double checking that I still can't do a debug on them. Let me switch my shell here.

[00:19:21]Get a temp directory.

[00:19:23]And I have shortcuts for like getting Omni config and stuff like that.

[00:19:30]So if I get clusters I probably have to indicate.

[00:19:34]Nope, it's already indicated.

[00:19:36]So Oh, Talos config cluster Talos. That gives me a Talos config here.

[00:19:48]I'm going to export Talos config as what ET does for me.

[00:19:51]Um that should my Talos config and now I can dmesg node copy. Just just checking that Talos config works. Okay, Talos config works there. So now I should be able to debug.

[00:20:13]That's the old one.

[00:20:15]I do have authentication. I'm just checking.

[00:20:18]Okay, yeah, it doesn't work on Omni clusters. I wanted to verify after everything got upgraded we're we're in the the right mindset that it will not work. Where is Where did my spark go? Okay.

[00:20:32]How did it get a host name? Oh, must have got it from network. I think I defined that on my network.

[00:20:39]Um awesome. Thanks for Thanks for joining. Uh yeah, it's everyone's a beginner at some point and this stuff is is different and hard to learn. So uh absolutely want to jump on and and I'm learning a lot of these features, too. I should have a Where's my Yeah, Talos control bootstrap. It's telling me right there, "Hey, you need to bootstrap this, dummy."

[00:20:57]Um it didn't really say dummy, but I did.

[00:21:00]>> [laughter] >> Uh do I have my config is set?

[00:21:06]So I should be able to bootstrap.

[00:21:13]Started the task for etcd. So etcd is starting. We're in a running state now.

[00:21:16]Everything's going to start pulling in.

[00:21:18]Now I have authentication against this.

[00:21:21]I should be able to run my debug.

[00:21:26]Oh, is it actually labeled spark?

[00:21:31]Because it got that from the Yeah, it's it's not that That isn't the actual endpoint. It's weird that this shows spark in my config.

[00:21:43]This is 30.

[00:21:47]Woohoo! We got a shell.

[00:21:49]Um the most most weird celebration of Look at I have SSH.

[00:21:54]>> [laughter] >> Um so this is the debug container. I'm in Alpine on this node here outside of Kubernetes. Again, I can still do this with a Kubernetes debug command, as well.

[00:22:08]Um I'd have to I don't I won't do it right this second.

[00:22:12]I'll show you this cuz it's going to work exactly the same way as a as a cube control debug host command would.

[00:22:19]Um where I have this host mounts that I can host. I can It's like the same thing, but this is actually on the node, right? Like this this top LS is in my container. This is the node itself. So if I cat ets/os-release, this will be Alpine cuz that's in the container. And then if I do the same thing with the host mount, that is Talos. And so you can see I have access to the Talos node from there.

[00:22:49]Now if I look at host mount No, where would it be?

[00:22:54]Uh media?

[00:23:00]>> [snorts] >> I'm curious their system.

[00:23:07]Uh I think it's run. I'm just going to curious what how far I can go into this rabbit hole cuz this is one of the things that we generally, you know, don't recommend is obviously getting an interactive shell on these hosts to poke around on them because there are since there are sensitive information. If this is a control plane node, it has secrets on certificates for the root of trust for for the node and everything else. Um so these are all areas that we usually like cordon off and we put them in separate partitions. We try to only mount them when needed. There's a bunch of stuff that we do and I was kind of curious if if I still can see those things.

[00:23:45]Um on in 1.13 because we've been we've been removing more and more of that access.

[00:23:50]Again, Omni node Omni nodes managed with Omni, you don't have this access because you don't have an admin credential to the Talos API. You have an operator with some limited scopes on what you can actually do.

[00:24:02]Um I don't remember where exactly That's machine sockets.

[00:24:11]Also, I should be able to if I installed Talos control into this container or I had a container that had Talos control, I could call the Talos API from within the node itself.

[00:24:22]Um so kind of cool.

[00:24:24]I'm not going to poke around too much.

[00:24:26]Uh if I exit out of that, again, if I do something like uh Talos control cube config cube That would be the file name. That should be right. Yeah. EK export that cube file get nodes. So I can see, you know, my my node is ready and then I also can do a very similar thing.

[00:24:53]Uh Is it I have to think I think I have to say node that?

[00:25:00]And then I give my Alpine image.

[00:25:04]Let's do the same thing.

[00:25:08]Let's make sure.

[00:25:11]Unknown command debug for cube control.

[00:25:17]1.14 Whoa.

[00:25:20]That is so old.

[00:25:22]>> [laughter] >> Hold on, let's get a new cube control.

[00:25:29]Uh I'm using this machine is is I rebuilt it with NYX and I'm amazed at uh a lot of things.

[00:25:39]>> [laughter] >> Um Hey, fun fact. This curl command was the first PR I I ever sent for cube for Kubernetes. Um I mean it's evolved since then. But when I first ran Kubernetes, there was no automatic curl commands to download cube config or cube control. And so I I I went through and I added it to the docs.

[00:26:01]It was always something that >> [laughter] >> I was like, "Why isn't this here? I need I need a curl command to give me this."

[00:26:06]Um Where is that going? That's going to go here in this local directory, I think.

[00:26:12]Yes.

[00:26:19]Uh let's put it in bin, I think would work.

[00:26:23]Uh hash R which cube control. Yeah. That gets me the right one and now let's see.

[00:26:35]Oh.

[00:26:39]Yeah. I have a I have a wrapper that I wrote called K.

[00:26:42]Um So it does some slightly different things.

[00:26:46]Um but I do see that my cube control version my client version is now 1.36.0 and my server version is 1.36.0. So now that debug command should work.

[00:26:56]Yes. Okay. Um where is I just want to IT in that image. Okay.

[00:27:11]Oh.

[00:27:12]>> [laughter] >> Um We block it by default. Talos blocks this by default in uh the default name space. So you would have to if you run something like name space cube system.

[00:27:27]Here you go. That's the security posture of Talos by default. You can't run these privileged pods. So now I have, again, a shell in Alpine on the host.

[00:27:40]All the same stuff is there, right? I can I can still see the the root file system of Talos. The difference here is I had to go through the Kubernetes API. So if Kubernetes is not working, uh Talos debug is giving you access that you did not have before.

[00:27:56]Um I need to find where I was at. Okay.

[00:27:59]So that's the that's the basic idea of of debug. Um really cool feature. Uh I'm glad we have the air-gapped option for it, too, because it makes it easier for us to um kind of say like, "Yeah, I If you don't have access to this thing, you can or don't have network or it's failing to pull, whatever, you can still do some stuff on it."

[00:28:18]Um I'm going to come back to this image signature verification. I do want to run it, but I want to test the GPU supports and I want to show a new dashboard feature that I know both of those exist and work because I've I've tried those. Um Where is it?

[00:28:38]Dashboard. Interactive dashboard.

[00:28:40]Um so we have a resource viewer in the dashboard. So if I go back to my jet KVM, there's this new F4 down here. Um you should probably be used to maybe this one uh shows you, you know, resources.

[00:28:54]We've optimized this quite a bit in 1.13. So it refreshes less frequently than it used to. Uh it also does a couple of like how it draws things looks a little different. It should use less resources in 1.13 than one anything before.

[00:29:12]>> [clears throat] >> Yeah, a lot of people don't know that you can shell into the host. Um it's it's like I uh madarao I can't spell.

[00:29:24]blog. Uh if you search for I think I I called it SSH. It was one of the first things that I was like, "Hey, more people should know about this." It's like a an option to Wow, do I not Oh, we say SSH a lot.

[00:29:38]Um >> [laughter] >> I have to find my my banner for It was before that before that. Here you go. How to SSH into Talos.

[00:29:47]>> [laughter] >> This was this was the the blog post that I wrote about 2024.

[00:29:53]Oh, 20 It was like literally 2 years ago when I was like, "Hey, this is something that people should know about because they don't know how this works and it's just a debug cube control debug command with node instead of a pod. So, you can debug directly into node. Gives you that sort of shell access to through Kubernetes.

[00:30:16]Yeah, 113 came out yesterday.

[00:30:18]Stick around. I'm I'm going to show off the in spark stuff right now. So.

[00:30:23]Just want to show a couple things here.

[00:30:25]Again, like that's the the normal resource or the monitor people know of static config for setting networking if you're booting a system. This only works in the metal ISO. It doesn't work if you're booting off of like VMware or something.

[00:30:39]In those cases, you want to use like the VMDK then configure it through cloud in it. It's a much better way of managing and and doing this. So, this some of these sometimes this tab the network tab is hidden.

[00:30:50]This resource viewer is the new one.

[00:30:52]And this one just shows you basically all the stuff that you can get from Talos control, you can get here.

[00:30:58]In case you can't access it for whatever reason. If I do Talos control get disks.

[00:31:04]Shows me my disks here.

[00:31:06]Get links. Shows me all of my links, right?

[00:31:10]And I can go into here and I think I can hit forward slash for a filter.

[00:31:16]Oh, you know, is it not?

[00:31:19]I don't think my keyboard is mapping properly into the into the into here.

[00:31:23]Let's see.

[00:31:25]Forward slash. There we go. Now I have a filter.

[00:31:27]Like yeah, see I have a Dvorak keyboard and so it's [laughter] like freaking out.

[00:31:32]It's like, "I don't know what you type."

[00:31:34]But I type links.

[00:31:36]Ah, go up here.

[00:31:38]I can't hit arrow keys.

[00:31:40]>> [laughter] >> Um Maybe not the best example. Wait, can I even do anything here? I can't type anything.

[00:31:47]Um that's fun.

[00:31:50]Why can't I do it at the host anyway?

[00:31:52]It all of the stuff you can browse. I want to I want to get this one.

[00:31:56]>> [laughter] >> Do I Wait, do I have to escape the filter? No. I don't have an escape key either cuz it's a I don't know. I my keyboard's stupid. Um don't do stupid things. Let's see. Escape.

[00:32:07]Ah.

[00:32:08]Oh, see that works. I just couldn't scroll in the in the filter. Okay, so let's go all the way where was links?

[00:32:16]K L I got to say the alphabet in my head.

[00:32:19]Um links link status. There you go. Look at that.

[00:32:23]Uh so, I have I have my links and I can dig into any of them. So, one of them Let's see. Which Which one actually has There's a bond CNI dummy.

[00:32:36]The excellent that's flannel. This one maybe?

[00:32:41]So, you can see the information directly from the device, which is super fun and even better if you have an escape key on your keyboard.

[00:32:49]>> [laughter] >> Cuz I can't do that.

[00:32:52]Um so, it's a it's another way of like, how do I get this information from from a host if I if I can't do anything. And so, you can see your disks, you can see your network configs.

[00:33:03]You are looking at a bunch of YAML on a console. It's not This is like an emergency access thing. You should not be relying on this for uh this is how I want to do all my config.

[00:33:13]Uh but yeah, it's it's something you can do.

[00:33:15]Let's go over to the factory and build me a custom image for this node. This is an arm 64.

[00:33:26]And I need Nvidia.

[00:33:30]Do One thing we notice as we were building this out for 113 is the Um let's see. We bumped production is 595 now and I want the Usually Nvidia recommends the open drivers for the spark and we found it not working very well. So, we are still using the We recommend the non-free K mod um drivers and there is one thing we're going to have to do here.

[00:33:58]And for Nvidia GPU support not upgrade notes.

[00:34:06]This is an arm box and we have Why did that not work?

[00:34:10]Click.

[00:34:13]What?

[00:34:14]Okay, that was weird.

[00:34:16]Um On on the arm CPUs, these Blackwell chips, there is a security flag. I do not remember what BTI stands for. Linux kernel BTI. It was branch indirect branch tracking.

[00:34:33]Um it's a security feature that we enable that doesn't work with the GPU operator. I I don't know why. Some of our devs figured it out and I'm like, "That's cool." So, on specifically the Blackwell architecture, you do need to set this argument in your kernel args.

[00:34:48]So, that's uh no BTI is set. So, we don't have the branch detection thing.

[00:34:57]Yeah, and so once that's set, I should be able to click next. And now I have my images that I can do an upgrade from. So, here's my upgrade, which will have the system extensions and that extra kernel arg baked in.

[00:35:12]So, I'm going to do upgrade here. So, I'm going to copy this image and I'm just going to do a straight upgrade.

[00:35:18]Which node am I Wait.

[00:35:21]Okay, yeah. That's the other. Okay, Talos control upgrade image image directly from the factory.

[00:35:28]And go back over here. We should see it come in once I There's my also my logs from debug container. You can see it does log to D message. So, if someone's doing stuff getting debug containers on your node, you should see them.

[00:35:47]Oh, is it not going to I had a weird issue yesterday where literally the it errored out on me. Like it didn't It was still building the image or something on the image factory and I this didn't work. So, let me see. My docker pull should also pull the same image.

[00:36:13]It's a fairly large image, too, which sucks. It's like 300 300 400 megs with the Nvidia drivers.

[00:36:19]So.

[00:36:22]What are you? Go away.

[00:36:26]Uh can I get the ISO?

[00:36:38]Come on.

[00:36:42]Get it in a second.

[00:36:44]Let me cancel this and actually see what is my uh get extensions.

[00:36:53]So, I don't have so I don't have any system extensions right now and I'm also you can tell I'm not using an image from the image factory because normally there'd be a a fake system extension that says my schematic.

[00:37:08]And so, I'm not doing that yet. This is a standard release ISO arm release ISO that didn't do anything. My This build's not working.

[00:37:18]Oh, we had maintenance on this earlier.

[00:37:21]I should have checked our status.

[00:37:26]Are we still doing maintenance on it?

[00:37:28]I would have picked a terrible time.

[00:37:30]Uh Oh, no.

[00:37:34]We're we're doing scheduled maintenance.

[00:37:37]>> [laughter] >> Um It might be what's caused Oh, look. It The download started. Okay, cancel that.

[00:37:44]Did my docker pull start?

[00:37:50]It's a bad time. I should probably coordinate this better [laughter] in the future.

[00:37:54]We'll we'll see if this starts pulling that image. If I get a pull, then I will go back and and revisit this. Let's go back to my what's new and I want to look at the versions or image verification signing.

[00:38:09]So, this is something that at the Talos level we will verify images that we pull for all of the system resource like kubelets and API server. All of that stuff. Cuz normally you can put this in the Kubernetes layer and Kubernetes is going You can say, "Give me a Caverno or some policy to all of my workloads need to be signed."

[00:38:29]We can do that one level deeper and we need to because all of our resources run in containers. So, I need to set up a patch here and verify it. Let's see.

[00:38:40]This is something I have never done before.

[00:38:42]This was a brand new feature.

[00:38:44]Yeah, kubelet is one that we build and run. That's how our kubelet runs on the nodes.

[00:38:50]Um installer image. Oh, it'll actually do it for the installers, too. That's cool.

[00:38:54]Policy is configured.

[00:38:58]And so, this is probably just our Yeah, so our signing key You have to It looks for a signing key from Sidero Labs.

[00:39:04]Um so, let's just apply that. Let's just straight up.

[00:39:08]Should work. Should be fine.

[00:39:11]Um I usually do it this way and just do like the full thing with a patch. Oh, no.

[00:39:20]I need the patch first.

[00:39:21]Uh You can also apply the patch separately. Um This like merges them together. So, if I have applied without a reboot.

[00:39:44]Go back. Nope.

[00:39:46]Not that one.

[00:39:47]This one.

[00:39:48]Um Refresh tough trusted root right there. So, we see that we actually get the verification applied. Now, I should have a way to verify this as well.

[00:40:06]We can skip certain images.

[00:40:16]Image labels.

[00:40:18]Oh, okay. So, I'll get a Talos there verified.

[00:40:22]It was just pulled by Talos. Let's try this.

[00:40:32]Do I need a So, they should have signature verification be verified checking image labels. So, I look at one of the images.

[00:40:43]Uh Yamo? Nope.

[00:40:52]Image Is there an image inspect? What am I looking for?

[00:40:59]List pull remove Talos bundle. Nope.

[00:41:07]Am I missing it? Where's that label?

[00:41:13]Maybe these aren't verified.

[00:41:19]Oh, see look at there This shows labels.

[00:41:21]Wait, why does that one show labels?

[00:41:27]Oh, I do have labels. Hold on.

[00:41:30]It's my third What What Which column is this? Digest, size label. Second to last label.

[00:41:38]Um Yeah. So, it it's probably blank. Let's Let's do this. There we go. Labels is definitely blank uh because it has not verified any of these. This was done I pulled all these before I set that up.

[00:41:52]It looks like my Docker pull finally worked. So, I should be able to do that upgrade command again. Sorry this is so tiny, but when this goes through for the second time, the new images should have verification on them.

[00:42:07]Um There we go. Now, it's doing the download. Now, we're we're going to be doing our upgrade here. I do need to set a patch to load the kernel modules.

[00:42:20]Um so, I'm going to get that prepped for as soon as this reboots. It's a 500 meg image.

[00:42:25]It's extracting.

[00:42:26]It's all fancy now. It We didn't get all those details before.

[00:42:30]Um let's go back into the docs and I don't need debug.

[00:42:39]GPU Why Why Why does that click not work? I don't know why I always have to I always have to search and then click.

[00:42:47]That's That sounds like That's like a weird bug. Okay. So, I can set these kernel modules in a second patch.

[00:42:55]Uh I'm just going to do it in here. Vim I can Let me just do it in the same patch.

[00:43:02]Um So, that's going to load those kernel modules for me and I'm going to apply this patch once the node's back up.

[00:43:15]So, I'll do the exact same thing I had before and now we'll get both things. Uh is this the spark already?

[00:43:21]Yeah, wow, it already rebooted.

[00:43:24]No.

[00:43:26]Did it reboot?

[00:43:29]Oh, yeah, it's waiting for Nvidia to be up. Geez, that was fast. Okay. Um Waiting for SDD health.

[00:43:37]Await. I'll let it do its thing. It's only a single node, so it doesn't need to like coordinate anything.

[00:43:50]I should be able to also um if I look at Oh, I don't want to cancel that yet.

[00:43:57]I can also go into the Omni cluster and I could set the like image verification cluster wide.

[00:44:09]So, it would always do that verification.

[00:44:15]It's probably unhealthy because of the Nvidia.

[00:44:21]Yeah, waiting for Nvidia to be up. So, I need to apply that patch.

[00:44:24]So, we will get my Talos config out of here and apply with a patch.

[00:44:37]And look at that. Uh Immediately Immediately goes healthy and we're ready to go.

[00:44:41]>> [laughter] >> Um And now Oh, I should be able to uh image ls.

[00:44:51]Where So, okay. I see three of these images were verified uh this time because I didn't change my Kubernetes version. That makes sense. These are the images that were used for basically the install and probably the What else?

[00:45:08]It's so hard to read on this. It would be this image.

[00:45:14]I think.

[00:45:18]Metal installer is the top one. Oh, I should also now have uh If I get extensions, I can see my Nvidia Both my Nvidia ones. There's my fake schematic. So, that's what it tracks whenever it's doing upgrades.

[00:45:33]Um so, we should be good to go.

[00:45:36]What was the You said dash Oh, it's all A is all name spaces. That actually we should probably have that. Um If I look at a non name space, I will see other images. Uh And none of them are verified cuz none of them have refreshed since I since Kubernetes. So, I'm already on 136, so I can't even do an upgrade.

[00:46:02]Um I wonder if I could I don't want to break it right now, >> [laughter] >> but I probably could like Talos control remove uh one of the images and it would redownload it and then be verified.

[00:46:13]Maybe we'll try that at the end. Um So, let's go back here and we already did this patch. Now, we should be able to do the standard um Nvidia GPU operator install.

[00:46:30]Create name space label the name space so it can run privileged.

[00:46:36]Pretty standard.

[00:46:38]And then install the operator. Uh in this case, we're going to like disable a bunch of stuff. Like don't try to write toolkit. Don't try to build a driver.

[00:46:46]Um and our our install path is there.

[00:46:58]What is this?

[00:46:59]Uh these are warnings, not errors, but it is violating policy um because not all helm charts are uh built with the best security practices.

[00:47:11]Um but those are just like warnings. I could ignore them in the future if I wanted to.

[00:47:19]How do I How do I verify that with the official?

[00:47:23]Run a CUDA app, of course.

[00:47:26]So, we can write this out. What do they call it? CUDA vector add.yamo.

[00:47:35]CUDA Did they hyphen it?

[00:47:43]I always try to make everything as copy-pastable as I can.

[00:47:47]Um So, and now we should be able to run it once this is installed.

[00:47:54]Let's export my cube config.

[00:48:01]Let's see if they're running. So, we have a pending here.

[00:48:05]Node feature discovery. The other ones look like they're running.

[00:48:09]Do I need to do something for that? I don't know.

[00:48:17]Um That's debug data uh which is actually super helpful if if you're working with Nvidia on like troubleshooting cards, they often want you to get information uh from the node itself. And so, um Ah, Kevin, thank you. Yep.

[00:48:41]You are 100% correct. Um Go all the way down.

[00:48:49]Allow scheduling on control plane.

[00:48:51]Something that's again I'm used to Omni doing for me.

[00:48:55]Oh, I don't have that.

[00:48:59]Talos config apply with a patch.

[00:49:14]As as by default There you go. Now Now, it worked.

[00:49:21]And now we get other things running. So, there's my CUDA validator DC DC DCGM exporter um and operator validator. Okay.

[00:49:35]Let's Do I have video installed?

[00:49:42]Oh.

[00:49:46]I do. There we go.

[00:49:48]Okay. So, that is running. Let's see.

[00:49:55]That completed running. So, I should be able to K Kuda Why can't I tap complete that?

[00:50:07]Creating up there.

[00:50:10]What else they tell you? They just tell you to look at the logs. Okay.

[00:50:16]It completed. Done. Test pass.

[00:50:19]Uh this is something that is is surprising how much work that took.

[00:50:24]>> [laughter] >> For for a variety of reasons here. Um one because of just the uh DGX Spark architecture with arm and everything needed its own stuff. Uh the GPU operator very heavy-handed operator that tries to do a bunch of stuff for you. Um mostly because all the other distros suck.

[00:50:43]I'm just going to call it out. Um where the GPU operator requires that it like installs you need DKMS for your kernel with kernel headers. It's going to try to install a kernel module for Nvidia driver on the fly. Uh inject that into your system. And so, all of what GPU operator usually does is because every other operating system requires like this a lot of maintenance actually install Nvidia cards. And and because we do all of that at the compile time with Talos, we build a a image that has it all built in. Uh you don't need to do any of that stuff. And so, we have to disable all that after the fact. Um and then also the fact that Talos is built with muscle uh instead of Glibsy and and Nvidia drivers are Glibsy. So, there's all these like things that we had to be aware of and make sure that uh we worked well with. And uh thank you Noel for doing a ton of uh which by the way a a Jet KVM in dev mode. I don't know which which one did I have that was dev mode.

[00:51:44]I don't think I have it turned on anymore.

[00:51:46]Um If you go into your Jet KVM advanced developer mode.

[00:51:52]This one has it.

[00:51:54]Oh, yeah. Yeah. Yeah. Uh KVM terminal.

[00:51:57]You get a terminal like directly on the KVM and you can SSH into it and you can install Tailscale. And super cool, you can also install any other binary you want like Talos control.

[00:52:10]>> [laughter] >> Directly on I don't know why it's not running.

[00:52:13]Uh there it is. Directly on the Jet KVM.

[00:52:16]So, I was able to like expose this, share it over Tailscale, and put Talos control on it. So, not only could you mount ISOs to it. Uh but then you can run commands against the Talos API. Very very amazing. Uh love that feature. Um Uh what was the other thing I was going to do in here? Um GPU operator is already deployed. GPU debug.

[00:52:38]Uh I thought there was something else with the Nvidia stack that I wanted to do, but I don't think so now. Now, this is just running on top of the Spark, has GPU operator. I can schedule any of my normal workloads. So, if I want to install like Olama or VLLM, there are a bunch of things happening in Linux kernel space and runtimes for LLMs that are very hard to get right right now. Um I spent a lot of time going back and forth on how this stuff works. And it's not all there yet. It's not all easy. Uh things like VLLM I think has a custom build for the Spark. It's yeah, fun fun times.

[00:53:21]Um What was the other thing in here? Oh, this also shows up as uh CDI is now uh enabled by default, which is a uh the container device interface. So, container D implemented uh this like extra shim that you don't need an Nvidia runtime to use the workload. Um if we go back and look at this from even 112, uh you'll see I had two other patches to do. Wait, I'm on the wrong page.

[00:53:55]This one. If I go back to 112 on the Nvidia drivers, you'll see I had one more syscall thing to set up. And then I also had to set up my Nvidia runtime class.

[00:54:05]Um these were all things that you used to have to do uh because of this CDI wasn't there. So, you had to in your workload say, I need a GPU. I need to use this runtime class.

[00:54:16]All of that sort of stuff. Now, you don't need to. With CDI uh and I think there was one other feature. Um you don't have to. If you say I need a GPU, container D should figure it out.

[00:54:27]And it should say I have a Nvidia card on here. So, actually I should be able to do something like make it node.

[00:54:36]This one.

[00:54:41]And I get all these oh exporters.

[00:54:48]Nvidia. I see my driver version, Kuda driver. All this stuff just comes for me automatically cuz the GPU operator. And I should have my Where's my capacity? There's oh, shoot.

[00:55:00]I went too far.

[00:55:02]My capacity was somewhere in here.

[00:55:05]There it is.

[00:55:06]I have one GPU available.

[00:55:07]>> [laughter] >> Um So, that's that stuff that's we're getting from the operator. It automatically is going to label the machines. All that stuff is going to work. Uh if I look at my old cluster or my my existing home lab cluster, every one of my nodes I think has a GPU on it.

[00:55:25]Um and I look at my patches.

[00:55:29]I got a bunch of them for various things.

[00:55:32]But some of them are going to be uh There's my Nvidia patch. Like that stuff applies.

[00:55:39]Um we're using the Nvidia um device driver or Kates driver in 112 and now we can use the operator. So, it's just I don't have to do some of the labels. I don't have to do some of the other stuff um that I needed to before.

[00:55:52]Oh, that was the other thing I was going to show is the DRA.

[00:55:55]Um device resource allocation.

[00:55:58]This one.

[00:55:59]Um Let's apply this patch and then we have an example here for Nvidia. I haven't run this yet. I want to run it.

[00:56:07]Let's test it out.

[00:56:10]Let's get to my patch. Uh actually I need to be up here.

[00:56:17]Kubelet. That's all cluster stuff. Let's enable that.

[00:56:26]Applies that patch. And then I should be able to run these to get my DRA driver up and running. And DRA driver just got uh donated to CNCF. So, it's like a official CNCF open source thing now. Uh which is pretty cool.

[00:56:52]First, we need to disable the device plugin components.

[00:56:58]Helm upgrade. Oh, this is for the GPU operator. I see. So, if you have DRA, you also have to disable that device plugin on the GPU operator.

[00:57:07]Makes sense. That is historically we were just using that.

[00:57:11]Operator includes that. Now, we can disable that um as a as DRA is the the future.

[00:57:26]Let's Let's run video again. See what's going on.

[00:57:35]Could delete that. Where is This is completed. Those are all running.

[00:57:38]What are you waiting for?

[00:57:45]Is Helm doing a full roll out? I don't know.

[00:57:50]There we go. Whatever you just did, good job.

[00:57:54]Bunch of warnings about again uh the operator is not the most security conscious thing. Um now we get to install the DRA GPU driver. There it goes.

[00:58:07]This is the DRA stuff is cool for not just video cards. Uh there's a bunch of things Oh, no. Don't give me crash loop back.

[00:58:15]Um we'll let it we'll let it cook and see if it goes.

[00:58:18]Uh it also lets you just like share other devices like 10 gig NICs or 100 gig NICs. Or uh DRA is meant to be a generic sharing utility for any sort of resources that are lower level on the system. Uh not a lot of things support it yet, but it's it's kind of moving that direction that this is the way to go. All of the stuff in the past was kind of hardcoded. And it was like this only works with Nvidia. Anything else you're on your own. Um so, DRA is is is that way forward. Um We'll see what that error actually looks like.

[00:58:57]Whoa. What?

[00:59:01]That's my wrapper failing.

[00:59:04]Uh I didn't get a dash. Where's my dash?

[00:59:22]I don't want to debug this completely cuz I still have other things I want to look at.

[00:59:29]Fail with status unmounts. Can't unmount.

[00:59:34]Invalid argument. Yeah, the workload shouldn't be able to mount things.

[00:59:41]Cuz not privileged maybe?

[00:59:44]I'm not going to debug that one right now.

[00:59:47]But I will mark it for future.

[00:59:50]I need to look at that and see cuz I I do want to move my spark to DRA and everything else in the future. So, I'm going to keep looking at that but just not on the stream. We've already been going for Where's my timer? An hour? Yeah, we're right at an hour.

[01:00:03]So, thanks everyone for coming. This is fun.

[01:00:06]Uh Let me know if there's something that you want me specifically to try in this in this cluster.

[01:00:15]Signature verification we showed.

[01:00:18]Uh flannel now has CNI policy.

[01:00:22]So, this runs as a sidecar as part of flannel. I don't know if I added I don't know if I add it after the fact if it gets applied or if I have to do it during install.

[01:00:38]Let's try it in Omni.

[01:00:40]Let's try it here.

[01:00:43]Config patches, do the whole cluster.

[01:00:48]Flannel.

[01:00:50]Get network policy true.

[01:00:54]That should patch everywhere.

[01:01:01]Uh do I have it exported? No, I don't.

[01:01:07]Cube config cube cluster Talos.

[01:01:11]There you go.

[01:01:12]Export cube.

[01:01:18]Now, I need to authenticate.

[01:01:20]What? Oh, I don't have I don't think I have cube OIDC installed on with NYX cube. Is it called cube login?

[01:01:38]Is it cube control?

[01:01:42]No, I know this works cuz I can I can cube control other things.

[01:01:48]Let me try this.

[01:01:50]I might also have something weird in my browser.

[01:01:53]Skip open browser.

[01:01:59]Yeah. That's probably it.

[01:02:05]Find the right browser.

[01:02:07]Great. There we go.

[01:02:11]I wrote a shim for my browser opening.

[01:02:14]Yeah, cuz I'm I'm stupid. Don't do these things.

[01:02:17]Okay, so this is my standard.

[01:02:22]But again, I don't think applying that works if the if the cluster's already been deployed.

[01:02:31]I am curious though if I remove a node destroy, confirm and re-add it. What?

[01:02:38]No, cuz it would affect the daemon set.

[01:02:41]So, I would have to re-deploy the cluster because the daemon set itself is going to fail.

[01:02:50]Let me see if these show any differences on my bootstrap manifests.

[01:02:57]Um owning inventory.

[01:03:00]That does That all looks the same.

[01:03:06]What was this? This is flannel here.

[01:03:09]Okay, here we go.

[01:03:10]So, I if I apply my bootstrap manifest, I should get the new change.

[01:03:15]So, let's apply cuz I did upgrade Kubernetes Did I upgrade Kubernetes on this one? I think I did. Yeah, cuz I started with 1.12. So, we're going to go through and apply that change and I think all of these flannel pods will now have two containers in them, one for the network policies.

[01:03:33]Did that go?

[01:03:34]Updated?

[01:03:36]Why does it still show 21?

[01:03:40]Um I can add that node back later.

[01:03:45]There you go. See, I have two containers Why Why only the one?

[01:03:51]>> [laughter] >> Uh that's weird.

[01:03:58]Get daemon set namespace cube system.

[01:04:03]Cube flannel.

[01:04:06]Uh YAML.

[01:04:12]Containers.

[01:04:16]Uh image is there.

[01:04:27]Yeah, so I have two. I have flannel and I have network policies. So, every one of these if I roll them would give me network policies after the bootstrap manifest was applied.

[01:04:40]So, that's good to know.

[01:04:42]The flannel policy should work in this case. You want to do it during bootstrap Oh, it says right here. I should have just read, you know?

[01:04:49]Uh It's already running. Sync the bootstrap manifest after applying patch.

[01:04:56]Great. It was already written down.

[01:04:58]Cool. Just do what the docs say. Don't listen to me.

[01:05:01]>> [laughter] >> Um Inventory back server side apply. I'm going to show that in Omni in a minute.

[01:05:12]Um Upgrade flow. This is enabling us in the future to do some more stuff where we have like different life cycles for doing upgrades and what not.

[01:05:21]It'll be It'll be very powerful once we start using these APIs. But instead of having like a flat like install API, we actually have like an upgrade API. We have a full life cycle for those APIs.

[01:05:30]Um I don't really know how to show that off until it's going to be used. Um yeah.

[01:05:36]Max volumes is cool. This works where you can have a negative. So, you can say like I want to use the rest of this disk minus 20 gigs or or 25% if they're different sizes.

[01:05:48]Which is great because if you have machines that have different size drives, you're like I always want to leave some percentage of this disk for uh logs or or some workload or something like that, you can do that a little easier. Before you had to just manually like edit all of them. If I look at my uh Kates, this one.

[01:06:10]I look at my patches and I look at let's say mini. Um I will see I have manual declaration for my how big my partition should be.

[01:06:25]10 gigs for this local path, 100 gigs for ephemeral for disk images.

[01:06:32]Which is going to be different than if I look at patches for Z because Z has a different size drive. Um so, I have 50 gigs here versus in 100 gigs for ephemeral. So, you can do some different percentage matching here for machines that have different sizes. Since like I put the spark in this cluster, it has 4 terabytes on it. I'm like, well it it's way more space than I need for any of this. So, I can separate that out. Um Link alias config. This is like you We have templates now for if you have a bunch of NYX on a on a server. We had people asking for that.

[01:07:06]Resolver config.

[01:07:09]What else What was the other things I wanted to look at here?

[01:07:15]Image decompression should be a lot faster now cuz we do have some better tools to decompress images.

[01:07:24]Oh, and imager. If you're running imager to build your images, you no longer need privileged and you don't need dev. So, you can run it in the standard docker run or I think it works It might work in podman as well now with rootless.

[01:07:42]And it's also fully reproducible. That was another thing that was really cool.

[01:07:46]If I build imager for an ISO and you build imager for an ISO, we should have the same shot. It should just match.

[01:07:53]They should be fully reproducible except for some things that aren't reproducible. There are some exceptions there but for a lot of them, they will be reproducible.

[01:08:01]Um Oh, this is a Talos true image list. That was cool.

[01:08:08]Oh.

[01:08:09]Still crashing.

[01:08:11]Uh image list.

[01:08:17]Woah, which one is Let Let me I'm looking very close here to see if I remove this.

[01:08:24]This is the scheduler.

[01:08:28]Can I image remove?

[01:08:35]Should that work?

[01:08:38]Image list image Oh, it's not RM, it's remove.

[01:08:46]Uh so, I do need the full.

[01:08:53]Where are you? Oh, wait. No, probably just need the tag.

[01:08:56]Let's try that.

[01:09:03]>> [snorts] >> Okay, did it break?

[01:09:09]It's still running.

[01:09:11]So, maybe that wouldn't if I Where is Where's my scheduler? There it is.

[01:09:27]Delete pod cube.

[01:09:34]I just deleted it.

[01:09:40]Yeah, there are images verified. Uh not all of I was trying to find one that wasn't verified to see if I deleted it from the host if it would if it would then come back as verified.

[01:09:55]To like refresh it.

[01:09:58]Verified via legacy signature bundle verified true.

[01:10:05]Controller like if I controller manager's not verified.

[01:10:14]Remove and if I do pull, what does that do it?

[01:10:21]Okay, it it deleted and pulled supposedly. That seemed really fast for a pull. I don't know about that.

[01:10:27]Uh list. And now where is my what which one do I do controller manager?

[01:10:34]And yeah, now it's verified.

[01:10:36]So on the on the pull again, it shows up verified. So you could patch it after that. You can do it during initial install which will get everything from the get-go.

[01:10:45]Uh or you can do it after the fact and you could like manually delete images um that were not verified and they should be verified on the next pull.

[01:10:53]Cool.

[01:10:55]Um What was the other thing? There was another thing here that I was looking at and I did that. I did the flannel.

[01:11:04]Bootstrap manifest. That was the other thing I was going to show off. Okay, so I want to show this is bootstrap manifest.

[01:11:09]Um Yeah, we're not don't open that. Uh there's a Where is this? It's in Omni. This is an Omni feature for uh sync Kubernetes manifest. There it is.

[01:11:26]Okay. So this is part of Omni 1.7. Um it it kind of relates to the Talos stuff as well.

[01:11:35]Uh and this is going to it's a nice way to replace our inline manifest and external manifest that are normally part of Talos. And so if I go into here and I export this. Where's my export command?

[01:11:48]There it is.

[01:11:49]Export.

[01:11:52]Uh That's the smart cluster.

[01:11:55]Let's export into a file.

[01:12:01]If I can spell right. cluster.yaml.

[01:12:05]And now I should be able to come in here and look at my manifest sync. And we have this new top level So we have different sync modes a one-time and a full.

[01:12:17]And a full is kind of like a GitOps controller.

[01:12:20]These are not meant to be workload This isn't meant to replace Flux or Argo or something like that. Uh it's kind of meant for like the lower level in the stack where Flux and Argo are great for workloads.

[01:12:33]You put all your workloads in there. You you sync them and it has a bunch of great features for that. If I want to look at something that's like below that if I have uh DataDog or I have um you know, something at like the the system level, how is this going to work? Um you can now put this Kubernetes as like a top level Oh, no, sorry. Manifest as a top level and it will keep these in sync. These have to be rendered manifest. These have They can't be templated. They can't be Helm charts.

[01:13:01]They can't do anything else there. Um But yes, exactly for Cilium install. For these sorts of stuff. Normally this is a bootstrap manifest that you're going to say external go apply this one time. But then things like upgrades, they don't work well. And and if you want to change a version, you have to like you have to do a bunch of work to do that. Um so this is a way that you can do that without needing to rely on Flux doing that or something else. Again, you for the applications absolutely do that in Flux.

[01:13:31]I usually call these like cluster services or or something that is for you required to know that a cluster is valid. Um I'm just going to take this example here.

[01:13:43]And paste it right here at the bottom.

[01:13:47]Do I need it Do I need it at the top? It probably needs to be at the top.

[01:13:51]There you go.

[01:13:53]Um and so this now will if I apply this back, let's see.

[01:13:59]We get pods.

[01:14:01]Nothing in in defaults. So cluster template sync cluster.

[01:14:09]Arg. What happened?

[01:14:13]Uh I have an old version of Omni, I bet.

[01:14:20]Come on.

[01:14:22]Double V. There you go. Now it's 1.7. Um failed the sync cluster.

[01:14:27]Manifest.

[01:14:31]It should be under Kubernetes. Thanks, Kevin.

[01:14:33]Um And that's probably exactly what this shows.

[01:14:41]It's still top level though, which is weird.

[01:14:43]I would think that that would like still that. Let's try it.

[01:15:00]Nope.

[01:15:01]It would Do I need it indented? It's not indented here.

[01:15:12]But also it's this is like most patches.

[01:15:15]If I go look at my uh do I still have this folder?

[01:15:23]So this is my home lab.

[01:15:25]Most of this stuff is file-based, right?

[01:15:27]So I can have a Cilium file and and I can just import that. You don't have to do it inline here.

[01:15:33]I'm just showing. So it does need to be indented.

[01:15:44]Yep, you are 100% right. Um now uh Tada.

[01:15:51]Um super boring example there. Um but is it is a huge difference in A how we have server-side apply on Talos itself.

[01:16:02]So we can have this sort of like ownership of where the manifests how the manifests are get there and who owns them. Uh and then B, Omni does that for me.

[01:16:11]Um we do not have any way in the UI that shows that.

[01:16:15]>> [laughter] >> Um this is something that is like if you are bootstrapping clusters with with templates which you should be if you're doing this a lot. Uh you'll you'll definitely gain benefits from it. Um we also moved things like the workload service proxy was originally a straight like Kubernetes apply. And actually maybe it was an inline manifest. I don't remember exactly. Uh this also if you check this box, it enables it and it ends up being part of the manifest sync as well. So if I export um this again and we should see Oh, maybe it hasn't applied yet.

[01:16:55]Patches.

[01:16:58]Oh, it's still just a feature. Ha. Under the hood, it's a manifest sync. But this is still just a cluster feature. Um so it doesn't actually show you the full It doesn't have to show you the full uh rendered version of it. Um but that is one thing that we already moved into this sort of method of working. So two things there.

[01:17:15]Um I think that's it.

[01:17:18]Right?

[01:17:20]Where do go back here.

[01:17:23]The big things are the debug and GPU stuff. Those were big changes for us. Um a bunch of small stuff uh behind the scenes for just better security, new build tooling, um things like that. Network policies is one. Like a lot of peo- people use Cilium partially just for the net- network policies. They're like actually just want really good network policies.

[01:17:41]Um Flannel is still the default Kubernetes network policies. Cilium adds a bunch of stuff on top of that.

[01:17:47]Um so it's not as as full policy featured as Cilium would be. Uh but it also is portable across CNIs. Cilium will will implement all of the Kubernetes policies and then they add their their own.

[01:18:01]Um I know I saw a bunch of stuff for like storage.

[01:18:05]Um Cube span. That was the other thing I was going to look at here. Um exclude advertised networks. Okay.

[01:18:12]I don't want to dive all the way into this. A because I've never done it before. And B, um I don't have another node that's not in my not Could I set one up? I probably could.

[01:18:24]Um I'm just going to talk about this one for right now. Exclude advertised networks is is part of Cube span which we now have Cube span is its own document. So if I go down here and look at reference uh go to configuration, go to network, I will have a new Cube span document Cube span config.

[01:18:44]Um It has this excluded or exclude advertised networks. This is for uh performance in a net in a LAN. In a layer two these nodes should be able to route to each other. Uh Cube span by default is setting up a WireGuard mesh between all of the nodes. And it doesn't matter where the node exists, it would always route traffic through that Cube span WireGuard connection. And now with this excluded networks anything in these networks don't go through Cube span. And the benefit of this is WireGuard has a has a pretty big overhead for performance. So if you have 10 gig NICs, I probably could do this testing cuz I now have 2.5 gig NICs.

[01:19:27]You will be bottlenecked by the speed of your CPU on your node because the WireGuard encryption is a single-threaded process that's going to like encrypt all my UDP packets and and it will have a bottleneck compared to raw network traffic with no encapsulation.

[01:19:44]Um so, if you have Kube-Span turned on, if you have multiple locations, uh you'll want to set up these excluded advertise networks just so that your local traffic stays local and your remote traffic can transparently happen over that encapsulation. You're going to get less throughput, but you probably between two locations you probably have less throughput anyway. Um plenty of people have, you know, fat pipes between uh two data centers, uh so you don't even need that, especially if you can route directly to them. But, it's something that's uh it actually happened again like in that like middle of the the Talos 112 uh framework where it's like, "Oh, this is a easy feature that we can just ship and get it out there."

[01:20:22]And people wanted it. And it allows you to uh get around You can still get spanned clusters without the Kube-Span overhead for local traffic. And again, for like network storage, um if you have like Kube-Span or or Kubernetes node sharing NFS or something that's pulling from different nodes, all of that stuff was always going through WireGuard, and now it doesn't have to. Um so, big feature here uh if you're using Kube-Span for your clusters.

[01:20:51]Think aliases I already talked about probes, resolver config. Uh yeah, there's a bunch of like great improvements that people have been asking for for a while. Um some of it is specific to like hardware use cases and and deployments. Um the things that I I'm going to figure out why the GPU operator uh the DRA isn't working on the Spark. And then I'm going to go through and upgrade my uh my main cluster because that's where most of my workloads run. And uh all of them All of these nodes have GPUs. The Spark has a GPU. And I think I have one more GPU node um that I'm adding. And honestly like I I use them all. It sucks.

[01:21:29]>> [laughter] >> Um it's it's been difficult uh to kind of manage that stuff in the past.

[01:21:35]Let's pop out here. So, there we go.

[01:21:38]Um so, yeah, thanks everyone for for hanging out.

[01:21:43]This is a long stream. Uh I am going to try to do some of these around these release cadences um partially for myself for education, partially just so we can like if you have a feature you want to look at, if you have specific hardware, we can even like get temporary hardware to like test this stuff out if we have features in the future that do things. Um yeah, it's always a it's always fun just to give it a try and I like to spend a block of time to kind of go through it. And I was going to do this anyway, so I figured why not just show people uh all the things that fail as I go through it. So, yeah, thanks everyone for coming out, and we'll talk to you again soon.

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Computer Science

🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam

Pranavaa-y4y

104 views•2026-06-02

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30