Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

SUSE Labs Conference - Wed May 20 - Track 2 (Svet)
Hinzugefügt: 2026-05-21

145 Aufrufe67:53:48suselabs6148Originalveröffentlichung: 2026-05-20

Linux kernel tracing employs multiple mechanisms: kprobes use breakpoint instructions (int3) for dynamic instrumentation with high flexibility but significant overhead; ftrace uses compiler-inserted NOPs patched at runtime for lower overhead but limited to function entries/exits; tracepoints provide static instrumentation with fixed event schemas for structured data; perf events enable sampling-based tracing for high-frequency events; and BPF (Berkeley Packet Filter) provides a safe, programmable layer that can attach to any event source with verification ensuring kernel safety. Each mechanism offers different trade-offs between flexibility, overhead, and programmability, with BPF offering the best balance of flexibility and safety through its verifier system.

[00:02:16]Okay.

[00:02:17]So good morning everybody. I hope you had a good breakfast. So I'll slowly start as people are coming. Uh so today I'll be speaking I'm Yankara from Susalaps work on performance team. Uh in this talk I'll be mostly speaking in my capacity as an upstream maintainer of FS notify subsystem and I'll be speaking about what's new in the file system notification. So first I will go through a bit of history or a bit of like standard FA notify functionality just quickly to give a context then to the new features uh so that people know what I'm speaking about. If you are more interested about like really the history of file system notification and so on you can check some of my older talks like I was speaking two years ago at plumbers about this and like even more about history at flaps conference in Tabore. So you can find those talks somewhere if you are really interested more about the development and uh history because this talk is more focused about like what's what's really new what has what features have been added. Uh anyway, general idea what FS notify or file system notification is about that uh you know if you want to check for changes in a file you you can have like the stupid approach of like pulling the file with stat you know seeing whether the timestamps on the iode have changed and if yes you like reread the file to learn about the new content that's really stupid yeah burns a lot of CPU and so on So, so people have been trying to come up with some interfaces which kind of give you uh like message when when the file changes or whatever like first in 2.4 times quite a few years ago already like they have come up with dotify which had multiple problems like you had to keep the file open to receive notifications and you were getting like signals when the file changed which like was inconvenient for many reasons. So then in 2613 uh people came up with I notify which was like 2005 so about 20 years ago. Uh that is mostly okay for a lot of use cases. People still use it today quite a bit. Uh it has some some problems mostly around like watching large directory hierarchy or uh identify like really getting the object which has caused the event which was really modified because like the directory structure can be changing under you as you are getting event. So it's not always possible to identify which object has generated the event but for a lot of use cases it's good enough. Uh and then in 2010, so about 16 years ago, people came with like FA notify which is like uh new like well new not really new these days but like another interface how to receive notification events which is uh which was motivated mostly by uh like antivirus vendors but since then it has like grown a lot. So uh now how fa notify uh works. So there are two system calls uh which are important for fa notify. One is fa notify in it which creates so-called notification group which is kind of an object which encapsulates all the information you want to receive. Uh so and it basically this system call returns a file descriptor which identifies this notification group and you use it for all the all the operations you want to do with the group. Uh to like be able to use the full FA notify functionality you have to be a system administrator.

[00:06:00]Uh however uh like we have added some ways. So, so you can currently use FA notify even like as an ordinary user and you will get some limited functionality basically equivalent to what or slightly better than what you get for I notify but in principle very similar functionalities as I notify is available even for unprivileged users and then actually something in between for uh administrators in the username space.

[00:06:30]Now uh like this system call takes su uh two set of flags uh one is about like how the file descriptor itself which describes the uh which describes the notification group should be created. So those are not very interesting but what is interesting like there there is second set of flex which controls uh how the notification group behaves.

[00:06:55]So like I will not go through individual flex because that would be too much but uh basically you specify the class of notification group uh which for us will not be very important but basically whether the group just receive normal events or whether you want to uh also be able to receive some of the let's say more privileged events I will be speaking about them later like permission events or uh like pre-content events for hierarchical storage management Uh then uh there are these flags like uh uh report fid report name and so on which control which information actually you are receiving with with the notification events and I will be speaking about this in a bit and then also you can control basically whether the queue of the events is unlimited uh which is of course uh available only for like administrators uh because you can run the kernel out of memory in principle or or whether like the amount of notification marks is unlimited like by default it is limited to to some number uh so that you cannot also like spend too much kernel memory.

[00:08:08]Okay. So uh now we when we have notification group created then like there is second system called fa notify mark which uh like really tells what events you want to receive. So fa notify mark places a notification mark on a file on a directory on a mount point or on a super block. Yeah. Or we can say on a file system like maybe that's more like understandable for users. Uh now you can ask what's the difference between mount point and superlock or file system. Well remember that you can have things like bind mounts. So basically you know you have one file system but you can have several views in the file system which are mostly independent like to different parts of the directory hierarchy or they can be somehow overlapping especially with like containers username spaces. This this happens a lot and so the mount point mark will receive only the changes or events that are happening through this particular mount point. So if you have single file available through multiple mount points then mount point mark will receive only changes uh done through one mount point while super block mark will receive like any change happening regardless of the mount point which is used for that and like for different use cases you actually want different things.

[00:09:32]Now uh originally FA notifi was supporting uh these like uh six events uh you have there written like access modify which is on on read and write.

[00:09:44]Yeah. Open open exec which happens on open or open for execution. Uh then there are two close events like when the file is closed which was not being open from Britain and like event which gets generated when the file was open for writing.

[00:09:59]uh and uh then later we have added uh these directory events which allow you to learn about changes to directory hierarchy with FA notify. So uh you can now ask for events like when the file is created in a directory removed from a directory you know delet these are basically equivalent to similar I notify events so you know you'll learn about creation deletion renames and so on uh or attribute changes this is the fun event now basically for compatibility with I notify we had this moved from and moved to events which are generated when a file is like moved from a directory somewhere else or moved into a directory. Now if you really want to learn about rename events which some applications do then you have to somehow match these moved from and moved to events and that's a bit tedious and with uh fa notify even not not really possible like reliably. So, so you have this fun rename event which actually tells you both both like information about where from where the file is moved and into where. Yeah. Uh so that tells you all the information about rename which is kind of convenient. Now uh when you are placing a mark on a directory you can ask uh about events like you can set this event on child flag which means that you are not interested in the events on the directory itself but actually happening to its children. So for example you can place a notification mark for open with event on child and you will be getting events whenever something in the directory is open.

[00:11:51]Uh now it works only for direct children of the directory. So it doesn't like tell you anything about if something happens like deeper in the directory hierarchy below this.

[00:12:02]Okay. FA notify has also the concept of ignore marks which is like the uh which is the uh uh flag below.

[00:12:14]Uh and uh like ignore mark basically is kind of inverse mark. So you are telling which events you don't want to receive.

[00:12:22]Yeah. Uh so maybe that sounds like stupid in the first uh site but actually it's pretty useful. For example, imagine you are watching a whole directory but you are not interested in events from a particular file which is like frequently accessed but you are not really interested there. So you basically what you do you place a notification mark on a directory you say you want to receive events from all the children and you place ignore mark on the child you are not interested in so that you don't have to deal with the noise in the notification stream.

[00:12:57]Uh and by default uh this ignore mark actually gets cleared on when the like object it is placed on is like is sending a modification event. Uh I like see my previous talks for historic reasons why this was done like this. uh but like the interesting thing is that you can set this uh flag like ignored surf modify. So then basically ignore marks are surviving the modify events and are not removed then okay so that's like these days mostly standard FA notify functionality which is there already for quite a few years like at least six I believe uh okay now when you when you have settled on like when you have placed the notification mark then you are ready to receive events now you receive events by reading from the notification ation uh group file descriptor. So like the descriptor you have got from fa notify in it you can read from it and you will be receiving like this the events uh that uh that you have asked for. So the event uh each event you receive will have this uh this structure plus there can be some additional information after it but I will speak about that in a moment. So uh the object uh to which the event like which is generating the event is identified uh normally by this f FD flag which is like open file descriptor pointing to the object which is generating the event.

[00:14:43]Now uh this actually works for the like traditional FA notify events for permission events and other things but it doesn't work specifically for directory events mostly for like kernel internal reasons like which are very difficult to overcome. So uh like for directory events this FD is not used and there is different way how you have to use how you have to identify where the event happened uh which we will talk about in a while. question about that.

[00:15:16]>> Yeah, maybe go to the mic so that like they hear it in the stream.

[00:15:21]>> So file descriptor or process. So what is the meaning of that field if one process does the uh lookup of the events but uh the event is generated by another process?

[00:15:34]>> Oh okay that's a good question. Uh so uh you know this open file descriptor is actually generated at the moment you read the event from the file descriptor.

[00:15:45]Yeah. So kernel keeps some internal like references to the object which is generating the event like reference to the iote actually and mount point. Uh and when the process reads this event from the file descriptor at that point we open the file descriptor and like put the reference to it into this reported thing. So it's like the the object is open in the context of process that is re receiving events.

[00:16:12]That's one of the reasons why why like administrative privileges are required because otherwise you could open arbitrary files on the file system.

[00:16:19]Yeah. So like this that's one of the reasons why this general functionality is restricted to uh is restricted to to administrators.

[00:16:31]But yeah, we'll see in a while how actually this let's say also permission problem has been overcome to allow FA notify for unprivileged users because like unprivileged users cannot like receive file descriptors like this. Uh okay. Uh now in the uh initial part of this FA notify event metadata there are four fields that are kind of like identifying how the event looks like. uh like there are these there is this version field uh and metadata l field which are always the same actually uh and which are not really useful like originally people thought they will be useful so that you can like modify the format of the events in the future. So the application like can check the version you know and depending on the version use appropriate layout of the structure but the reality is no application in user space actually ever checks the version uh so you cannot really rely on uh rely on this you know this is not simply API design that is really useful so don't like version uh API structures because applications in the user space are are not going to care uh but actually and metadata N is like the length of this structure. But again like in theory you can you could expand it but you cannot really change the layout because the application will just get confused and crash. Uh what is actually working for design and what has proven very useful in the ability to extend the FA notify events structure is the event len field.

[00:18:09]So this basically says how long the whole event is.

[00:18:13]uh and like initially it's the length of the structure which is like 24 bytes but uh like when we are extending the API we basically increase the event len field uh so or basically how the extension of the API works that's useful so when you want to receive some additional information with the event you set appropriate flux uh when you are creating the notification group by this you are actually acknowledging Yes, I understand the new format. This is this is like I'm going to receive. Yeah. And then in the event plan you will learn actually how many exactly like additional information you are getting with this event and you can parse it uh and uh you know basically get the information from from the event. So so this is like extensible way. So basically each application has to explicitly opt in to receive some like newer modified version of the API and then basically even then is telling it exactly what it is receiving.

[00:19:20]Uh okay so so now I have been speaking for a while about the additional information. So how does it exactly look like? Uh so there for directory events uh there are four additional like supplemental informations you can ask for. These are like this these four like FID uh directory FID target FID and name uh and it is appended be behind the uh initial like structure. Uh now basically each supplemental information which is appended begins with this header uh which specifies the type of the information and the length of this information record. Uh so you can like have multiple additional informations appended after the initial event. Yeah.

[00:20:08]And basically the event length tells you the total length of all the information.

[00:20:13]And uh with this header you can then like go through one by one through the information records to learn about what is there.

[00:20:24]Okay. Uh so uh now what uh how does the supplemental information looks like in particular? So I want to speak a bit about this report FID field which is basically file ID that's alternative way how you can identify how you can identify which object is generating the event which file or which directory or whatever is generating the event. Yeah, like as I was speaking uh shortly before that like normally you get a file descriptor pointing to the object that's uh that's generating the event.

[00:21:04]Uh but that requires privileges requires like opening uh the the file descriptor when when you are actually receiving the event. So it's kind of expensive. Uh so we have come up with this uh FID concept which uh FID concept basically contain this FSID which is like some 64-bit identifier uniquely identifying the file system. Uh and then like standard file handle as used by for example NFS.

[00:21:36]Uh so uh this uniquely basically identifies the object in the kernel and it has the advantage if you have seen Mikuel Cotney's talk like on Monday that you know it doesn't really pin the object in memory. Uh it also doesn't give you any privilege over accessing this object. So it you can safely give out this identifier even to unprivileged user and simply if he is not privileged to access the object he will not be able to. He just knows that some object like this exists but you know he's not able to actually infer any any sign useful information from this but still he can use this identifier to actually match uh like different events whether they actually are generated by the same object or not or yeah other way uh he can it's useful also in other ways even for unprivileged users.

[00:22:27]Uh so for directory events uh we generate identifiers like this. Uh and uh yeah uh now besides this identifiers like for uh for object with directory events we have like this additional so their f their fid supplemental information which you can receive is like the direct parent directory of the object which is generating the event. So uh this is useful so that if the if for example a file generates the event then you know you want you may want to learn its pattern directory so that because with that you can then reconstruct the full path to the object which is generating the event. Uh target fid is kind of the other way around. So if like directory events normally for directory events like for example for create the object that is generating the event is the directory itself. Yeah.

[00:23:32]But you may want to learn about the file that's actually being created in the directory and target FID gives you identifier of that. So so with target FID you will get identification of the file that's actually being created in the directory for example. and uh report name. This is basically the name in the directory of the object that's generating the event.

[00:23:58]Uh okay. Uh then now let's move a bit to additional supplemental information. So you know each FA notify event actually contains a bit field which tells you uh the process ID of the process which is like causing the event.

[00:24:16]Now this P has the disadvantage that it can get recycled. So by the time you read the event and get to processing it, the P can already point to a different process and that is a problem for some use cases. Uh so uh we have uh like fun report bit fd flag with which you can uh like additionally to uh getting p you will be also getting like supplemental information field uh which will contain uh the bit fd which is a concept like kernel concept. Basically you get a file descriptor which is uniquely pointing to the process. If the process exits the bit fd will tell you about that like this bit fd is pointing to a process that has already exited and you can still use bit fd to learn something about the process even though it has exited.

[00:25:12]So fa notify now supports this.

[00:25:17]Uh okay another concept we have added uh are soal avicable iode marks. So normally when you place a mark on the file or directory it pins the uh iode in memory and iodes are relatively large things like for common file system the iode easily has 1 kilobyte or so. So if you monitor large directory hierarchies and place notification marks on lot of these iodes and directories uh in particular ignore marks like there are applications which know for example watch the whole file system and then like place ignore marks on files they have already processed and don't care about them anymore. Uh so they can easily place tens of millions or hundreds of millions of notification marks in the file system. And if you multiply this by the size of the iode which is pinned in memory by such notification mark, it kind of adds up to gigabytes of kernel memory. So uh evictable mark is a flag which tells the kernel you know I am happy to lose this mark just free the iode if you need. So uh when the like when the like reclaim decides to free the iode the iode is free the notification mark is destroyed by this and so uh the this the memory overhead then doesn't really matter.

[00:26:43]Yeah. And uh this is you this is useless in particular with the combination with the ignore marks because basically you place ignore mark on an object uh and if the kernel decides to remove the object from like on iode and if the kernel decides to remove the iode from memory then it can happily do so. If the iode is used again and it's pulled back into memory then you will start receiving events from it again. But generally you don't care. You will just quickly figure out you know I'm not interested in this iron mode still. So I just place notification like ignore mark on it again. Uh uh now we also have patches in flight which are sitting in my tree and I still need to like finish them and like submit next version. Uh so that we don't like no actually no notification mark pins iode in memory. Uh basically how it works is that uh when the iode gets removed from memory gets reclaimed we will just uh you know keep the notification mark in memory just disconnect it from the iode and when the iode gets cached again we will reconnect the notification mark with the iode obviously there are some technical difficulties with this but you know the concept is simple okay Another feature which was added relatively recently by Gabriel Chrisman Bertazi while he was still at Collabora but now he's with us. Uh so it's notification about file system error events.

[00:28:23]So uh basically with these events uh like you can it's like the fun fs error event you can request to receive and uh this event is generated when the file system hits some failure like IO error or file system corruption or similar uh similar issue. Uh so uh this is used mostly by like system management tools uh because what they were doing so far is they were scraping the kernel logs and if they learned there is some suspicious matches message in the kernel log they were like taking the m like notifying system administrators that there is something wrong you know taking like migrating the load away and you know taking the machine down for example. Now, of course, the scraping of the kernel locks is not really great and is not 100% reliable. So, this like fa notify error notification is is kind of much more reliable. Uh Derek Wong has recently uh extended this. So, you can now with the event also ask for additional like file ID information. So it will tell you which uh which file actually has hit this error. You know not every error of course is associated with a file like it can be like general shared file system metadata that is having problems. But if it is like some file which is really uh hitting the error then you will learn about which file it is. Uh and uh this is uh and uh this is useful uh this is used actually by XFS because XFS now has this online file system check functionality uh and it uses this information plus some additional information is also sending like to guide the online file system check uh to guide online file system check uh checker Okay. Now, permission events. So, permission events are ancient thing.

[00:30:36]They basically are there almost since the beginning of FA notify. Uh but I will speak about them kind of to stage up prepare the stage for the hierarchal storage management events. So uh we have like originally F notify had four permission events which were used by antivirus scanners. So uh these are like generated on open or open for execution or on access. Oh sorry three events.

[00:31:04]Yeah. Uh and they allow mediating access to files. So basically how permission event works when you try to let's say open the file the permission event is generated and the system call is paused until the user space replies to this event and with the reply you learn whether you should like continue with the system call or whether you should stop it. So if you uh if you like get accept then the system call simply continues. If you get deny then you return basically e access error to user space. Uh now of course because now the progress of the system call depends on user space replying to you. It is a source of interesting deadlocks we have been dealing with over the years like from various customers because what sometimes happen is that the application which is responsible for processing the events uh it uh uh it basically does something which generates even on the file system it is watching. So it gets block the application which is responsible responsible for replying gets blocked itself like waiting for its own reply.

[00:32:18]Yeah. Uh so uh that that is like frequent enough that recently uh Miklosh Sed from Redhead has added this e like permission event watchdog which basically checks uh like pending FA notifi permission events which have been already like reported to user space but which have not been yet replied to and if they are there longer that sitting there longer than like configured period we start to like uh report into the kernel logs. you know we have here these pending events likely the user space did something stupid. Uh so basically it's mostly to make it easier for support guys you know to just check the if they get reported some deadlock related to FA notify they can just check kernel logs and see okay you know user space did something stupid and there is like some information with the reported events which can ease debugging so you don't don't have to analyze kernel crash dumps to learn this uh Okay.

[00:33:24]Uh the response uh which which user space has to write the response is happening I I don't think I have said this is done by writing to the control uh to the notification group descriptor and you write there this uh structure like this.

[00:33:40]Okay. So now storage management that's a feature that is that has been merged like year and a half ago.

[00:33:47]uh that's uh so so first maybe I'll start with what hierarch hierarchical storage management is so you know the idea is that or how we actually use it here is that you have some local file system but you don't want to but you fill the content of this local file system for from some remote storage like for example cloud storage like Amazon you know Amazon storage or whatever these cloud vendors have uh so so you fill in the content of the local file system from this like slow uh cloud storage on demand. Yeah. And it's used by vendors like usual about these like large cloud vendors like Amazon and Facebook and so on. Uh it's used to speed up the creation of the container because you don't want to pull the whole container image uh like and then so that you can start the container. Yeah. they just want to pull only the bits the container is going to really need uh on demand so that they can start the container as fast as possible and only if the container needs something it is pulled from remote storage.

[00:34:57]Uh now you want the access to the slow storage to be transparent. So like locally it has to appear like local file system and there are several approaches actually to like filling the content like uh you can the simplest one is like when the file is open you simply load it from the remote storage and store it locally so that like further accesses are fast. Uh you can also like decide for a more fine grain approach when like you fill in on read or write. Yeah. So when you you when user ask to read some data it only at that moment you will fill in the data from remote storage or you can go even like further and for example don't even create the directory here before only when let's say user does read directory or look up in a directory you fetch the content from remote storage.

[00:35:53]So uh like these different approaches to uh hierarchical storage management then requires slightly different approaches.

[00:36:03]Anyway in Linux you can already do this kind of thing by several like there are several approaches you can already take.

[00:36:11]One is obvious obviously fuse file systems in user space that is very flexible. Yeah. Basically the kernel kind of exposes file system and for anything that happens in the file system like user space demon gets called and it can do whatever it wishes to kind of provide the content. Yeah. So that is very flexible. You can definitely do this. The problem is that you know there is certain overhead uh with the round trip to user space to the demon to provide the content and get back to the kernel to like provide the content to the application. So it's difficult to achieve the performance of a real local file system if you do it with fuse especially for like metadata intensive loads the overhead like and additional latency is very visible.

[00:37:00]Uh you can also use FS cache which is kind of a local caching of a remote file system but it's tied to like basically two networking file system to NF like only NFS and sambar like know about FS cache and can use it. Uh also Ero FS actually was using uh FS cache you know EO FS is this read only file system it's used also for container storage uh like it's implementing soal like so-called container FS mode uh and uh so and there is exactly the requirement so that like Huawei I think uses it and other like Chinese cloud vendors maybe uh yeah bite dance and stuff like this. So they use it exactly to pull like only the bits of the container image that are needed for their container. Uh and so like they were using FS cache as well like for this like partial pooling but because FS cache is kind of created around network file systems. It doesn't really work well for EO FS. So they are getting away from it. they are removing FSK support and they instead they are using these FA notify hierarchical storage management events I'll be talking about in a moment.

[00:38:20]So so uh how these events are working uh actually it's only one event but anyway uh it's kind of similar to permission events. So basically you generate an event you wait for user and you block the system call until the user space replies everything is ready and then you let the system call proceed or you deny it like if if the user space denies uh it requires the special like class of notification group uh so which in particular means it requires basically administrator privileges but but I guess that's not surprising. Uh currently we have like pre-access only one pre-access event and basically this is a event that is guaranteed to be generated at least once before some file content is accessed for the first time. Yeah. After the file system mount. So uh like it's deliberately specified in this like not very exact way because you know sometimes it can happen that actually we generate the event more times. Uh sometimes we actually generate the event even before like it is really needed by user space. For example due to read ahead or similar things kernel can decide to fetch some data in advance and we don't want to uh like deny it or so.

[00:39:42]So uh that's uh that's kind of uh why this it is so vaguely specified but it's like exact enough to be useful for the hierarchical storage managers. Uh so we generate this pre-acess events on read although on write uh with each event you also get a range which is accessed by the uh by this read or write or read ahead or whatever. uh you get also like pre-acess event on open uh but that basically tells you you know you are opening the file I'm not giving you the range uh also on mm map we generate the pre-acess events and on basically it covers the whole range that is being mm mapped that's somewhat inconvenient because like and in the past we actually what we had implemented was that the page fault was generating the event just for the really accessed part of the file but uh even then we like reverted it because it was causing like there are problems with deadlocks with file system freezing. So we decided it's not worth the bother because the real users are not that much interested in this functionality. So so they can do without it at least for now. But it's still open problem which would be nice to to address one time.

[00:41:06]Uh so maybe now how the demon how the hierarchical storage management is going to work like in a big picture. So basically there is a demon which is watching for this pre-access events and it it internally keeps track of what it has already downloaded from the remote storage and what not.

[00:41:28]uh and uh whenever it gets the pre-access event it you know fetches it from the remote storage it locally and marks it locally that you know I have now this part of the file downloaded uh and when the download is done basically it replies to the event you know you can now proceed with this system call uh the demon has to be somewhat careful to fill in the file content without generating new events itself yeah because that would deadlock block. So, uh there is a elegant way how to do this like you can create a separate private mount for this like HSM management application and place an ignore mark on this mount. So basically whatever access is going to happen through this mount it will not generate new events and this way the demon can safely access the file system and fill in the content without uh like generating new pre-acess events.

[00:42:32]uh and uh basically once the full file is kind of populated uh you can uh you can place ignore mark on it so that you don't receive new new events and this is already used by meta and couple other cloud vendors like in various shapes or forms okay uh some challenges I'll now since we have like five minutes only I will now kind of speed up a bit so uh Yeah, there there are issues that kind of the data consistency of the file system relies on the demon running. So we are now working on a scheme which like allows save updates of the demon or save handling of the crashes. It's using like FD store functionality in systemd.

[00:43:20]So you can basically place the notification group FD into FD store. So even if the application crashes, the notification group still exists. So the new system calls are just getting blocked. When the new demon is started like either new version or basically after the crash the demon gets restarted, it fetches the FD from the the notification group desktop from the FD store and like continues replying to events.

[00:43:46]Deadlocks uh with fast system freezing I have already talked about. And now about some bit more like crazy stuff which which is also happening. So with FA notify these days you can also receive uh notifications about mount events. So you will receive like mount attach mount detach events.

[00:44:06]Uh you basically have to place a special notification mark on the mount name space and then uh that basically is receiving these mount attach mount detach events.

[00:44:20]uh what is in the work is general notification about name spaces. So uh you will you can place a notification mark on basically arbitrary user namespace and you will be getting information about creation and deletion of like namespaces under this user name space. Uh patches are in the works. We will get there I guess soon uh to learn about this. Now uh people are putting out even like wilder thoughts like basically placing a notification mark on the process because with pitf process can be identified by a file descriptor.

[00:44:59]So in principle it is accessible to fa notify. So you can you could place a notification mark on the process and learn about various events that are happening with the process like getting event on fork or on exit or threat or like clone or whatever. Yeah. Also people want to place notification marks on croups like to receive again notification about various events happening with a croup like new croup created under this croup or you know some some croup is getting throttled or whatever events. Yeah because generally people are interested in these events in some management tools.

[00:45:36]uh actually for croups we are already generating these events to some extents using kern FS which is like underlying file system managing the croup file system uh but it is currently kind of awkward and you know in some cases where you would expect the event to get generated they are not generated because of how can FS implemented and because like the what is visible in the croup file system is not exactly matching what like the internal kind of repres representation of the croups. So like currently it behaves it's somewhat awkward. It works for most use cases but not quite. So if you would have like notification marks placed directly from croups and like generate the events directly that that could perhaps provide like more consistent experience but these are just wild thoughts at this point you know but probably we'll get there at what point okay so conclusion.

[00:46:34]So I would say third time is the charm.

[00:46:36]Like the third version of the notification API seems to be quite successful. After 15 years, we are still able to extend the API and don't like although there are some let's say ugly quirks with the API, we were able to mostly overcome them and uh we can we can extend the API as we need. uh like recent in recent years it's going in general not only it's not only about file system notification but there are there is more functionality like it seems to be more or less transforming into general notification mechanism uh for the kernel basically which is allowed by the concept that everything is kind of a file descriptor in Unix and you have this also file ids like file handlers and so on which are very strong and useful concept like in the kernel that has proven very useful Yeah and you know it's actively evolving area we are getting lot of new functionality like storage management file system error notification name space events and like more patch sets are flowing and in the flight so yeah it's actively developing area and thank you for attention maybe we have time for one quick question but yeah thanks >> are you aware of work for the event notification relating to device files.

[00:48:03]device would be one any device. So something happening to the device you get.

[00:48:12]>> Uh well it could be done. Nobody asked for it yet. We would have to really talk about exact semantics which events you want to receive so that like performance makes sense and so on. But >> you're you're used for that before.

[00:48:34]error. We just didn't implement it because the customer lost interest but it was we had a request for like tracking errors in the block device.

[00:48:46]>> Well, it's more in our case our is more about the new features that you have on HDS where you can deop heads without completing the drive. Uh so that's only applicable to SMR for for zones. So essentially what happens you head We have a few zones that go offline. So, >> okay, maybe we can take it offline because we are kind of running on time, but we can discuss this. Yeah. Okay.

[00:49:10]Thanks for attention. If anybody wants have some more questions, I'm still around. So, feel free to catch me and we can talk.

[00:52:55]Looks like there is like something to speak something.

[00:53:02]>> Uh testing one, two, three.

[00:53:05]>> Okay.

[00:53:10]All right. Uh, thank you uh, for being here. My name is uh, Ken Johnson and I'll be presenting this session to you.

[00:53:19]Uh, we're about six minutes late, so I'm going to skim through a lot of this probably pretty quickly up front. Um, here's kind of the agenda for the presentation.

[00:53:32]Um, as an introduction, I'm part of the partner engineering team. Um, we now uh are part of the the quality engineering Linux team under Ralph and um our the IHV side of our team is responsible for administering the yes certification program for IHV partners.

[00:53:54]We're also responsible for developing the system uh certification kit uh that partners use for certifying their hardware and we help partners uh throughout that process as they do that to debug issues that they come across.

[00:54:12]So the YES program has a long storied history. Started back with novel in 1988. And when novel acquired SUSA, they merged the no uh the YES certification program with the certification program that SUSA had at the time. And they developed and released a new self-certification program for SLE. This is just a few of the the partners uh that we have. Oh, a few of the partners that we have um right now. Um there are many others uh that you can find on the yes search and at this point in time uh we have just short of uh 21,000 yes certifications since sle onwards.

[00:54:56]So why yes certification? Uh it enables sales of solutions that help uh customers and partners and SUSA reduce costs, manage complexity and mitigate risk. Um, we help customers easily identify and purchase hardware solutions that have been tested for compatibility and are approved and by SUSA and are fully supported. And a key point of this is they're tested and approved and fully supported by SUSA and they're contractually committed support by the partner. Um, that is part of the agreement they sign up for when they join our YES certification program.

[00:55:37]So at a high level uh to join the certification program or to do a certification a partner needs to join the program sorry join the program uh do the certification testing uh we work with us to create a certification bulletin it's released and then it's available on our public yes search web page. So these are just a few links for how a partner goes about doing that joining. So once they join the program um we do do some testing for some partners, some of our tier one partners that provide hardware to SUSA. They physically provide the hardware to us and we will actually do certification on their behalf for the systems that they ask us to do. Um otherwise the vast majority of the certifications are uh being done by the partners themselves um through the self-certification where they set up the test environment um they execute the tests and then they submit the test results to us to be analyzed.

[00:56:45]And then once they've been analyzed and approved, uh we go ahead and uh generate up with them uh the bulletin and the contents of that bulletin um showing what was actually tested uh in what was the tested configuration during testing testing and any additional configuration notes that might need to be noted uh for issues or uh situations that were ran into that might help customers or support uh if they are supporting this system.

[00:57:17]This is just a a a diagram just to give you a visual reference of when we talk about certifying. It actually requires two physical systems. You have your system under test over here and then on the the other side you've got your your test console system. This is our our test harness that um actually runs our test console application that manages the the testing. uh process um for for the system under test. And the depiction here is really meant to also call out the fact that uh the server or the system under test that's being tested could have four adapters, could have 16 adapters. Um, and the test console needs to have an equal number of adapters, a 1:1 ratio there, so that we can test each network connection independently on its own network segment. Um, and also wanted to call out the fact that we also test a serial port if a serial port is in a system which is nowadays becoming not so common it seems.

[00:58:29]Um, this is an example of a of a yes certification bulletin that that actually gets published. This is by our partner Lenovo. And this is a a bulletin for a Sled 16 uh.0 uh system that they did just recently here in earlier this month. Um, and you'll see the the operating system up here at the top uh called out what what certified OS it was certified with. And then you also see here there's an other product listed. If our partner agrees to also support SUS Linux Micro for a SLES 16 bulletin since it shares the same codebase with micro 6.2, we allow them to also list that product as an other supported product.

[00:59:16]And then we move on down to a product description that the the partner fills in. And then we have the tested configuration which as you'll notice here we actually call out what sort of a computer it is uh like a form factor. So in this case a rack mount um a motherboard revision. We document up the the BIOS or UEFI uh and call out if it was booted in legacy mode BIOS mode or UEFI mode. So in this case it was UEFI mode.

[00:59:45]um and gives the the version string and a date string. And we also um allow testing with secure boot. And if a test if a system was tested uh in a secure boot mode, we actually will also call that out on that BIOS UEF line, UEFI line.

[01:00:05]But uh then we go on and we show things like the CPU, RAM, video adapter, host bus adapters, and storage devices. and then the the test kit that was used to certify it. A couple of things that don't show up here because they weren't present in this is if they had a a GPU accelerator and they ran our GPU test, that is something would be called out here um on this under an accelerator line. Um also if they had a system with persistent memory uh that was configured uh either in appd direct mode or memory mode that would also be called out here as well as how it was configured when it was tested.

[01:00:42]Uh we then also have this other required section down here of configuration notes. We require that they document how the system was installed. Um so whether it was you know physical media uh virtual media through a BMC or whether they actually did a network boot and installed via a network boot. And then because it's so long I I did another screenshot here. So this is the the the follow on to it. We list the drivers and adapters uh that were loaded during testing. And then at the bottom, we list uh contact information for the uh partner.

[01:01:22]So I just quickly will I won't spend a lot of time on this one. This one is a little bit different. This one is a sled KVM one. So we also do virtualization testing uh KVM or Zen if we're back on the SUS 15 uh releases. Um so the difference you'll notice here on this bullet and again it says SL 16 uh for AMD 64 Intel 64 with KVM. So showing that uh the the host was a or the the system was acting as a as a host hypervisor. Um but then you'll also notice we list the virtual machines that were configured and tested.

[01:02:02]So, um, in this case, they tested with just three SLES 16 guests. Um, this could be, uh, a SLES 16.0 guest, a SES 15 SB7 guest, and a Windows server guest. Uh, it's up to the partner if they want to test with a Windows server guest or not. That's optional and up to them.

[01:02:24]And some there are some partners that religiously do that on all of their uh uh certifications, virtualization certifications, and other partners just stick with SLES guests. Um so the other thing to to call out here is that um in our config notes with virtualization testing, we require the testing of an SRIOV device. Um, and we make the partner uh document up which network interface was tested using Srio. If the platform or the network device doesn't support Sri Srv, we allow them to fall back on PCI pass through and document again as a PCI pass through config note what network interface was used.

[01:03:13]So the C system certification kit, this is just at a high level over the last three years, just kind of the releases we've done and the release cadence. Um, our releases are always released uh in coordination with a SLES release. So we released it at the same time that uh like Suz 15 SP7 or SES 16.0 is released or with micro 6.0 or micro 6.1 since we tested those individually.

[01:03:40]Um, I'm not going to go over each of these. They just call that this. I'm going to focus on uh this last one, our our recent or our current release that we're working on to support SES 16.1.

[01:03:53]And that's our SDK 10.1 that's currently in beta. Again, slated to release in November alongside uh 16.1.

[01:04:01]Uh, and here it lists also the the OSS that can be tested with this version of the kit. So, SLES 16.0, uh, Sle Micro 6.2, 61, 60 or SES 15 SP7, uh, SLE 15 SP7. So, it could be SLES or SLED, um, or the the SL with workstation extensions. Anyways, uh, also wanted to call out a couple of the enhancements that we have done with the different releases. In this one, we've enabled the kit obviously to support Sle 16.1, but with that uh immutable mode is the big thing. We already had support for immutable mode within a micro project for certifying micro 6.06.1.

[01:04:48]Uh but now we're integrating it into our server project, which is uh the project that 90% of our bulletins um are done with. Um, uh, we've added support for, uh, a TPM 2.0 test. Uh, I'll talk a little bit about that later, uh, based off of feedback we got internally. Um, we've added that test. We've also enhanced the GPU compute test for, um, handling AMD accelerators uh, at this point in time.

[01:05:19]And then we've also added a BCI container workload to our stress testing.

[01:05:26]Um so as I I mentioned our our release cadence uh follows the the SLES 16.x release cadence. Now um some of the challenges we have like with this kit is uh we're testing both standard and immutable. Um we're trying to understand how we actually are going to have partners certify whether we have them uh just certify in immutable mode. Um that's something that has been discussed. Um but there's still some ongoing discussions about how that will occur and if if we will be able to have them just certified just with immutability mode or or we'll need some standard mode testing also. Um the other thing is is our test kit uh needs to be um ready for partners at RC1 and RC2 time frame. And this is because we allow our partners to actually start doing our tier one partners to start doing official certifications on SLES 16.1 when RC1 or RC2 is released depending on when the KABI is locked. With SES 16.0 that wasn't until RC2. Uh but at that point in time, we started to let our partners start testing and cue those up for SES6.0 FCS. So that when SES 16.0 0 was released at FCS. We released uh a few hundred RPMs or not RPMs, a few hundred uh bulletins uh at that time.

[01:07:00]Um so key focus is hardware touch points and um primarily we're focused on compatibility testing not functional or performance.

[01:07:11]And we've got 10 minutes. So I'm going to try and hurry to the deeper part of this is just a block diagram kind of showing uh giving an idea of some of the touch points uh that we're again focused on the the kernel and the operating system to the hardware and then uh virtualization.

[01:07:31]So here's really where I wanted to get to and cover and kind of go in a little more in depth with everyone and that is um what is it that our test kit actually what are the tests that we actually run and what is it we do. So for the server project again that's what I'm focusing on here which is uh the main focus from our partners is we focus on the OS boot process uh the installation and that is documented up on the bulletins that we release how that is done. Uh we also test the secure boot if that again was uh enabled during testing.

[01:08:14]Um we also uh test network configuration. So them configuring all the network devices for IPv4 and IPv6 testing test time synchronization. We do component validation and collect all of the drivers and devices within the system. Um then we focus on three main areas of testing. We have some manual tests that the partner will run. We have some automated tests that are run and then we have a a stress test uh part of the the the testing that is run. So the manual tests these all require some amount of interactivity by the tester. Um and so we group those all together. The first is a graphics adapter test. So making sure that the graphics adapture actually works um in the system and displays uh the Gnome or Wayland uh interface. The reason why this is highlighted in blue is because some of our vendors, IBM in particular for power or mainframe operate in a headless environment. So that's not uh a required test uh for them. Um obviously also the optical verify write uh is only done if obviously the system has an optical device uh in the system. And for the server project, the the Hibernate S4 and Sleep S3, those are optional for the server. Those tests are not optional when it comes to our workstation project or our laptop tablet project or our uh point of uh service project. uh the the partners have to run those tests and if they fail those tests that would be something that needs to be documented um so that a customer would know that those performance um management features are not available uh with this particular hardware platform.

[01:10:15]Um the next one is persistent memory.

[01:10:17]Again, if the system is configured with persistent memory, which today not many systems are, um it's been a few years since I think the vendors were selling systems uh with persistent memory, but if they were, uh we allow them to te test in appdirect mode, in memory mode, or in a mixed mode. So, we have them test partially in appdirect mode and then switch over to memory mode and then complete the testing in memory mode.

[01:10:45]um we require them to to pass a Kdump test. So verifying that on the platform they could actually capture a crash dump. Um and then there's a the GPU compute test that we have which until now has been just Nvidia based accelerators and now has been enhanced to also include AMD accelerators.

[01:11:06]Uh for our automated tests that we run through, we've got a serial port test which is optional again and that partly being because a number of systems no longer have uh serial ports. Um uh a CPU frequency scaling um so OS OS ability to scale the the processor frequencies. uh CPU hot plug, memory hot plug, watchdog timer, a firmware BIOS test, which is the upstream FWTS project that we run.

[01:11:36]Um that is optional. Um the majority of our partners run that. Um some do not though. Um then we do some network validation.

[01:11:48]um which is essentially verifying that the interfaces that they have configured between the the SUT and the test console or the test harness that those are properly configured uh that um they're matching bandwidth. So if there's a 25 gig adapter or 100 gig adapter matching on the SU and the TC side and we do some baseline throughput testing of that for a minimal amount of time uh because then once we get down to our stress testing we'll be doing 12 hours of stress testing of those devices. Then we also have a watchdog timer and then the TPM 2.0 device uh test. And that is something I wanted to call out is new and that was something that was um brought up to us when we worked with Thorston on. Um he helped us with trying to um identify what it is is the with full disc encryption functionality. We found that with certain TPM 22 2.0 devices that they actually didn't have the needed functionality to support what we required with the full disc encryption.

[01:12:52]And so within TPM 2.0 So instead of incrementing those from TPM2 whatever they have just different specification levels and so we've identified with Thorston a way to check and verify that we actually will have a TPM2 auto device that will support uh full disk encryption. So again that is new with our new test kit uh for 16.1.

[01:13:17]And then we have our our 12-h hour stress test run uh that we're going to be doing uh which does memory and CPU IO. Um we do IO across the the storage all storage devices in the system um network IPv4 IPv6 uh stress testing optical read again if there's an optical device in there and uh USB storage. uh to test the USB ports, we have our partners attach two uh USB storage devices and we do IO to those devices and then there's uh time synchronization and error checking and that runs for 12 hours. Usually we have them run that overnight. Uh so they run the other test during the day and when they head home they are finished the day they they start the 12-h hour battery test, come back the next day and it should be finished for them. Um once those tests are done that they can then run another test in our test console that gathers up all the test logs all the system configuration. It creates a submission file that they then provide to us that we then evaluate uh those test results and either approve or reject.

[01:14:31]So I think uh in the uh because of time I won't really focus too much on the virtualization. This is one I wanted to go over a bit but I just I'll just call out here that with virtualization we essentially have two main legs of testing that we do. We have the the partners set up uh multiple VMs which needs to be three or more VMs and um we test them running at the same time. We make sure that they do CPU overcommit um on the system for that and then tear that run the the stress testing for 12 hours in the MVM uh configuration. Tear that down and set up a a max uh VM configuration and then run through an hour of stress testing with that.

[01:15:22]And here's just a couple of um uh screenshots of our our test console application from our test harness. This is from a sled 16 uh beta 3 uh immutable mode that I ran just last week on a system I have. As you'll see right now, currently the only two tests that are failing on this particular system is the Kdump test uh which right now there's an issue with the transactional update Kdump setup uh not actually completing.

[01:15:52]So hopefully that'll be addressed with uh open beta. Um and then I I think the GPU compute test that's uh we're investigating and that very well might be an issue with our um our uh test kit right now at this point in time. So anyways, and here's an additional virtualization one. Um, but we've got one minute left. So part of the reason for doing this is wanting input and feedback um from you. Um, you know, some of the things that we have been talking about and and questioning and discussing internal with other um uh uh stakeholders has been, you know, immutable mode and the testing for SLES 16.1. and if that is sufficient uh for partners just to test in that mode. The other is uh newer technologies like CXL uh Fujitsu for example is uh working closely with us on CS CXL and they have expressed interest in having us now test that and so we've also been working with HANA and he has outlined for us um some tests that we could um run there. we have been limited on hardware and so we need to actually get that but this is kind of a stretch goal for our 10.1 uh maybe a post 10.1 thing that we can test for. Uh but anyways you can see this um that's the end uh we're out of time but uh if there are any questions uh I'll go ahead and try and take some questions and uh we'll go from there.

[01:17:28]So would you be expecting the OEM vendors to be running the S certification tests or would you expect it to be for instance um running those tests in their ring apps and that sort of thing? Is it super micro? Was it us? Is it >> so whomever?

[01:17:49]>> Yeah. No, that's a great question. So, um, we actually have some partners that, um, run that as part of their just in their validation, uh, of bringing up a platform and stuff like that. Um, we can talk offline about that because there's there's some interesting caveats to that, you know, of being potentially an ODM for another supplier and and other IHVs then being able to leverage a certification that you've done. Um but yeah we can maybe talk about that offline.

[01:18:25]>> How long do you expect the PM tests will remain part of the uh the certification because like NVMs are basically gone.

[01:18:34]>> They are. Yeah. But we still have systems in particular Lenovo has systems that they are still um supporting and they still are certifying uh with newer releases and so they are still running those tests. Um but yeah, at at some point in time it no longer is going to make sense to have that as part of our test kit and and we'll just have to wait and see when that is when there's no longer any platforms I guess that are being certified with it. Um we're out of time but uh thank you for your attention. Uh, I appreciate it.

[01:23:18]Okay. So this is uh supposed to be a discussion about K build. Uh uh this is our internal build system for the kernel. Uh and I would like to invite discussion uh how it works for people and how it doesn't work. uh at the start I will uh uh give some overview of what it is.

[01:23:44]Um it doesn't look so great looks like the uh the screen somehow cut off part of the of the presentation.

[01:23:59]This is like overview of the of the cable system. The AONUS machine is uh what you would uh talk to when you are uh using the uh cable system. It has the git repositories which are accessible over SSH or uh or over the web and the web has some additional resources that are available. uh also this is uh now a new machine which is quite powerful. So uh most of the features of K build are running on this machine. There is only one part that uh that is uh running on a dedicated machine and it's the repository expansion that uh converts the quilt repository into the expanded repository that you can directly build.

[01:24:54]uh there is a number of builders that uh that are still used for the test builds but uh part of the test builds is also happening on Apollonius now.

[01:25:08]So the there were recent changes uh to the uh to the cable system. Uh there is ongoing effort to rewrite the legacy code so that it's easier to maintain.

[01:25:22]Uh uh there is the new server hardware.

[01:25:26]The deployment on the new server hardware was done with with anible which means that should we uh need to deploy on a different piece of hardware it should be much easier than in the past.

[01:25:39]uh we change the storage uh which stores like temporary data that is uh uh used by the build uh from NFS to webDAV which means that it is also that these uh intermediate results are now also available on the AONUS machine because it does pretty much everything. So if you knew where on the machine the data is, you could look at it and there is some plan to make this uh uh available in the web UI so that you have a link there and you can look at the data.

[01:26:20]Oh yes and we had uh migration from user DB to SUSA ID. So now we are using the new modern authentication system and there are still people coming and asking oh why can't I log in because the the the authentication system changed and yes there are remaining problems a lot of the code is legacy code which is uh written mostly in parl and shell and is really difficult to understand and because we have no tests for it anytime it's changed uh something else breaks and uh we are not using containers because the code is not structured in a way that there are uh uh easy to separate tasks that could be executed uh separately in a container.

[01:27:18]So that uh that is uh uh the end of the overview. Uh the reason why we are using this and why we are not using for example gitlab or gy that we have now is that there is significant difference between the features that are provided uh by these uh new CIS and uh by between the the way we are using the G repository on current CVS uh because uh part of it is that it's really old It's named car current CVS because originally it used CVS because there was no git yet. And part of it is that our workflow is very different from most projects for which these new forges are designed. We have a lot of maintainers and a lot of branches that are indep independently maintained and uh uh the current features of the of things like gitlab or gy aren't working for that. There is an attempt uh to uh to like supplement uh gy with uh bots uh to uh to achieve similar results. In OBS, we have this new Git workflow and you see how easy it is to make it working and how well it went so far.

[01:28:50]So, uh, anyone has some input on K build and uh uh how it uh any suggestions for improvements or comments or on what works and what doesn't work.

[01:29:13]Yeah. Okay. Uh um my one of my big wishes for the Santa Claus is that uh KU provides the the review results from the web page or something like something like that. So it also just sends the email and forgotten.

[01:29:34]But can we keep the latest result of the each branch each each p request to store somewhere in the server and and one can see that result again from the web page from current CVS or >> uh yes there is a feature open to store the build log and make it available uh through the um web interface. case it is implemented but it it is not merged yet.

[01:30:07]>> Okay.

[01:30:08]>> And it doesn't provide the full result so far. It only provides the build lock.

[01:30:13]So that could be expanded.

[01:30:15]>> Yeah. Uh so I don't mean I don't necessarily mean the build but uh that review results that whole.

[01:30:23]>> Yeah.

[01:30:23]>> Yes. The build log is needed for another feature because we we have the problem that uh if there are new warnings, the build log is only downloaded uh from OBS when so you get the warning uh result only after the um branch is uploaded to like the the p request is merged into the product branch. The product branch is uploaded to OBS and it builds there.

[01:30:51]uh the build lock is downloaded and then it's compared to the previous build log from the previous build result and then you get uh uh the notice about new warnings which could be accelerated by uh saving the build logs from the build tests and then comparing those.

[01:31:08]>> Yeah. Yeah. I I see.

[01:31:14]>> So you mentioned that Kel creates these expanded uh kernel trees. Uh we now have rapid quilt which is super fast. So do we still need the to create these expanded trees?

[01:31:27]>> Uh these expanded trees are >> so there are users available to Yeah.

[01:31:34]>> Yes. One thing uh one thing is that there are users of these trees and another thing is that ripid quilt only expands one comet. Whereas this uh like creates a get history uh which has uh each of the comets in kernel source expanded to an expanded tree and it is connected as a get history. So you can see how the uh code changed by uh applying the changes in kernel source.

[01:32:03]>> Yeah. Also that it helps for the g bisect. If you want to biseect the changes from the one point to another then the expanded tree helps a lot just get check out and copy your config and try do bisect.

[01:32:20]So another thing first of all I want to say thank you for doing all this work on cable because it's very helpful.

[01:32:33]Yeah, and I know it's not fun to work on very old code in languages you do not like, but anyway, but one thing just on my wish list is that a lot of the checks that is run by K build, it's hard to replicate them locally. So just when um I would like to have as many of the checks to be able to run locally. I know it's difficult because you need infrastructure to run some of them. uh because sometimes I'm in a hurry and I look at the build queue and there are so many people ahead of me and I need to get my things done. So I don't want to submit wait for the result and then see that that I missed something. So just one thing to to it's on my wish list. So >> uh the thing is that you don't actually need infrastructure for running the tests but what you need is u um some well structured code so that you can run the one test that you are interested in in isolation. That's the problem we are facing with uh trying to move to uh running the tests in containers like the the code is not structured in a way that it lends uh to like running the tests individually.

[01:33:41]>> Yeah. Yeah. Fair enough.

[01:33:52]One comment on what Patrick just said is that for me was uh I was very happy when I learned to run git fixes locally for example. I don't know it's just a little thing that uh made my life a little easier and uh it's not impossible because I did it. Uh so one can look into how to run G. One of the pain point is you push it. Oh there's one g fix it miss it you're missing and it's not that hard to to run git fix on your machine. I think >> there is a git fixes script in kernel source which aims at running the same thing that uh ku does but it doesn't always work and I'm not the author of g of g fixes of git fixes tool so I don't know why it fails it sometimes works and sometimes doesn't >> you kind of have to install because you need to have the fixes command into git like you need that part which is a separate I think. Uh so you need to install the git git fixes thing and then you can run the script that >> the git fixes thing is available from the kernel tools repository in OBS.

[01:35:10]One thing that is on my wish list and that relates what Patrick said uh is that you submit something and then you're waiting there's a queue of uh submitted branches and it would be really helpful to see what kind how the queue currently looks like am I 10th in a in in a row or uh can I expect that to finish like in hours time or uh I have to go and take a break or um especially in situations where you really want to have a quick turn turn around uh knowing where you are standing right now would be really helpful. I know that it cannot be really precise but uh at least to see that there are five people ahead of me would be really helpful if we can display that information somewhere.

[01:35:59]>> Um yes. So the uh >> time >> uh uh it should theoretic theoretically be possible to display the queue but the information is like uh difficult to interpret because uh the ordering is implicit. There is uh no um like first place, second place. There is some algorithm that selects which job to run next but uh it doesn't tell you which will which will run next.

[01:36:37]If I recall correctly, there's a like for each currently pending branch, there's a how long that branch is uh currently queue and waiting for to be built and tested. And for me, that is good enough matrix to see like how long I have to wait like how how many people are ahead of me.

[01:37:00]>> Yeah, there's this pending pending list.

[01:37:04]So, you can look per I think per branch.

[01:37:07]Yes, but I'm not sure I I don't remember. But do we prioritize for next submissions over random branch names?

[01:37:16]>> It is planned but it is not done like uh the scheduling works like this. So there is a giant scheduling log and when the log is free the scheduleuler like it's not really the scheduleuler like it there is a job that collects all the branches that are not yet built uh schedules uh some job for each branch.

[01:37:40]When the job finishes, it schedules another bunch of jobs for each branch.

[01:37:46]And when the br when these finish, it again schedules another bunch of jobs for each branch. And you can see the length of the queue. So you can see how many branches are currently being built.

[01:38:00]But you cannot see how many branches are waiting to be scheduled because that's uh that will be uh like those will be scheduled only once. this uh this process that is scheduling the branches it knows about finishes and then again reads the branches that are not built yet.

[01:38:20]>> Would it make sense to have a way to bribe the scheduleuler to prioritize my stuff because I really need it quickly?

[01:38:27]also my stuff.

[01:38:30]>> I mean, a typical situation is like uh last week for example, we were doing submissions for uh those emergency updates and it it took really quite a long time to get those built tested and uh at that time when I know that I'm I'm heading into submission but I need to make sure that all the KBI checks are uh done properly that takes some time. So if I had a way to let's say send the magic encoded namespace that this is really priority because uh I need to have it that as soon as possible of course no subject to abuse right uh but uh something like that would be really helpful especially in those times when you are under time pressure to to submit very quickly >> it is understandable and the current scheduleul architecture or the scheduleuler could do it theoretically But the way it is used prevents that because you have this giant scheduling log and when your branch arrives it is in the list of branches that the schedule doesn't know about at all. They are in the list that is waiting to be scheduled in the next patch.

[01:39:52]One question. Do I understand correctly that um uh branches are are test you don't test one branch at a time but you try to to to batch a little bit. Is that what the scatter is doing?

[01:40:10]>> Well, the idea was to uh par to test all branches in parallel. Yeah, I was I wanted to get there to kind of explore that thing because not as easy but it's not also as hard.

[01:40:24]The problem is that there is state which is not available anywhere only in the process that schedules the individual chops for each branch and uh the this state is protected by the global lock which creates this batched processing that uh the this uh pro this process learns about the branches that are not built at the time it started.

[01:40:52]then processes all the tests for these branches, then finishes which freeze the global lock and then it can learn about new branches.

[01:41:05]So maybe we're looking at this from the wrong angle that maybe as many of us use K build as our way to test if we've done everything correctly. Maybe it should just be like the test to not let through anything broken. So like what I previously said that I would be very happy if I could run the exact same tests locally then that would be what I always do and when I submit I know it's fine and that what Kill does is just to make sure nothing bad gets through. So I mean then we would get around a lot of these issues that we have that we need to prioritize and yeah it's a shared resource and so on. True >> maybe. Yeah, you probably wanted I had it.

[01:41:53]Yes, that's like uh the basis for both of these is to structure the tests uh so that they can be easily scheduled independently so that either you can run them locally on your machine or the scheduleuler can abolish this giant log because this tests are truly independent and can like if it learns about new branches it can schedule more tests for these branches without problem. Yeah.

[01:42:23]>> So essentially what you're saying is I just want to replicate cable on my machine very easily like >> no exactly >> not the whole thing just just a worker side of things. I don't like that. I have a comment that probably you want to reply but I have a comment on that on on the principle on the principle like uh when so there is a shared resource which is the builder right and that kind of um remark goes on the the side of the morally good thing not to abuse the shared resource etc. And so I do more effort on my own in order not to kind of wait on the commons. All right. No.

[01:43:06]Right. Let's make the the shared infrastructure more powerful. All right.

[01:43:10]Let's not make it my problem. You see what I mean? Like um I don't like when you start say okay I should I should work harder myself. No, let's make a more powerful central um that that would be I think if I had to design the thing and if I have infinite money, infinite people, infinite art, whatever, I would um that would be the road that I would pursue and not more like guys, you should behave like Yeah, guys, you should run on your computer before running on the No, no, no. I like fire and forget. It's it's great, you know.

[01:43:39]Let's not go go the guilty tree, Pavo. You shouldn't do that.

[01:43:44]>> No, but but it's not about you shouldn't do that. It's about that we use it for a purpose that I just want to see where I'm at. So I throw it away and it comes back and >> very legitimate.

[01:43:54]>> Yeah. Ex. Exactly. But then we need a prioritized way that okay I'm in a hurry. This needs to go in. It's high priority. So we need that. But I mean it's not about it's more about if I run it locally I'm more in control. I know that okay it takes this long. I can plan my day better. and you prefer to.

[01:44:17]>> So, no, no, no. But I but that's fine.

[01:44:19]But it's just that it's it is a preference and I think that preference should be an option.

[01:44:24]>> Yeah, that's what I'm saying.

[01:44:27]>> Yes. And when the tests are easy to separate from the environment, you have this option.

[01:44:34]>> But unfortunately, that's not the case yet.

[01:44:38]>> Patrick is saying that he's he's testing his work in the middle. like he's not done yet but he just want to see where he is like if cable bill checks will fail at at the state I am so that I can fix this before continuing >> is that right yeah >> and that's a different that's the difference between the final submission where you have to make sure this goes in it needs to be correct and that check is important but it's a bit of check >> uh so I'm from Q department I'm not using current CVS at all and but your discussion ring a bell because basically we have a similar situation. uh we have uh open QA which run tests and then there is different parties and sometimes again like there is some urgent fix needs to be released and the people who responsible are saying hey please run my jobs first before others because this is super important and needs to be done and I just want to elaborate how it's done in in open QA maybe it would ring some bells uh for you uh so we have like uh each set of tests has um um priorities which are defined uh on um on on the yeah which predefined uh but then in in the there is like web interface and you know that this job has this priority and then the person can at runtime uh change the priority for the certain job and this uh power has all users uh and yes sometimes times and there is like a side effect that there is some like fight between. So you put like default priority is 90 uh and lower is um runs first and some people start to put minus 100 minus 200 uh to be like really on on on the first list. Uh but yeah, this somehow at least gives um this power when when you need to release something urgent, you can click the minus button as much as you want. Yeah. So just just saying then you can see that someone is always >> Yes. And then you can come to this person and say, "Hey man, why why you did it? Why you did it?" And there is sometimes it works, sometimes it doesn't work, but at least it gives some mechanism to um really speed up your your task. Maybe you can have a monthly kota so people can uh can you can have a monthly kota idea >> so they they they can they can you know uh prioritize like two or three times per month something.

[01:47:25]>> Yeah. Yeah. We we actually like partially joke in joke but partially seriously there was discussions to create some kind of currency and and you get some I don't know you get some credits which you can which which you can use to speed up your job but it's limited resource but currently it's like you can do it endlessly really.

[01:47:45]>> Okay.

[01:47:47]>> It would be nice if that can be solved as well.

[01:47:50]I mean that's a social problem and uh technical solutions for social problems rarely work really. So I I think that the best way forward is just to make sure that people talk to each other and and and can agree on something. I think that we are mostly reasonable people. So that shouldn't be a major problem, but having a priority or I really need something I really need something really now would be really great if that was a planned feature for the future. I'm not saying we need to have it right now but uh but >> yes it's a planned feature and as of now uh we encountered the problem that it cannot be easily implemented because of this u structure of the the code that uh causes uh the jobs to be executed in batches which means that you can set priority to minus 1,000 But uh you still will not be scheduled until the current batch ends.

[01:48:57]And like we don't have priority at this moment, but even if we did implement it, it would have limited uh results.

[01:49:08]>> Um the speaking of the local build, I think that most of the thing can be just done by running the OSV wrapper build locally. And but what missing is a git fixes and also patch metadata checks but the KBI is should be checked by the the standard build process and yeah well the build compile compile error of course was was called better there and I don't know what else missing >> would that be the correct place to insert the rest of it rest of the checks >> patch metadata checks are already done uh by uh pre-commit checks that are duplicated between kernel source and k build checks there. I don't know. Yeah, >> like the there are the patch metadata checks uh that are uh checking the the headers. These are duplic duplicated and there is the uh review for git uh for KBI uh fixes which is only done on current CVS.

[01:50:25]>> Yeah. Then that means that we if we run the OSC wrapper build then most of the things can be checked by that.

[01:50:33]>> Yeah. And just git fix is missing but it's it's pretty easy to run locally.

[01:50:39]>> Yeah. Uh one practical problem with OC build uh is that it takes significantly longer than the build itself and the other is that the packaging part. In other words, the difference between the build as such and package build uh does not parallelize nearly as much as the build itself. So you do not get as much benefit from running it on a strong machine like Kunon or >> this is why people rather avoid that way but it's it's a way to check the build locally.

[01:51:22]I also tried to like create a container for building the kernel locally and uh it's in a branch on on current CVS in the scripts like sub uh sub sub like there is a script branch that is not merged that has this and the problem is that uh there is fairly complex logic for selecting the compile compiler uh for non x86 builds which is not implemented. But if you want to build x86, you can like uh merge this branch and uh you can uh run the build in a container without the packaging stuff.

[01:52:11]>> One comment regarding the kabi checker.

[01:52:13]KBA checker can be run locally, but of course you need to have a build first before you can run the KBA checker.

[01:52:22]It can be run locally but there like it's complex to run so it's uh difficult to do it correctly and we would need some script uh that uh has the built result somewhere and does the correct invocation.

[01:52:39]So it could be piggybacked on top of this because uh the result of the build is the build kernel and then you have uh the input to the Kabi checker.

[01:52:55]I uh do you have more material? I I wanted to to comment on uh on the testing uh something that from from different project from my past job like 15 years ago and uh it's relevant. Okay. So same stuff same stuff you have your code you uh you send it this and uh then you have to wait because there is an implicit order. You could say, "Oh, I test all them in parallel at the same time." But then you say, "Uhhuh, that's not exactly the same because you are testing the branch from that guy, the branch from that guy, the branch from that guy." But uh then you have to merge it them together and have you test the final merge. So the the parallel testing um is generally dismissed and I think it's dismissed by us us as as well I believe or not >> uh like uh we don't test merging all the branches together we only test merging each of the branches to the product branch. Yeah, but but one after the other, right? In a sequence, right? You have >> No, it's independent.

[01:54:10]>> Independent. So we we could test n at the same time.

[01:54:15]>> Only limit is hardware. Is that right?

[01:54:18]>> Yes, you could test all at the same time. If you had a large cluster, you could schedule all the tests.

[01:54:25]>> Uh so we don't have a large class. This is the problem. Why are we waiting time?

[01:54:29]Why why are we waiting? Because we don't have enough machines. Is that uh I'm like I'm Am I catching that right?

[01:54:36]>> One problem is that the amount the number of machines is finite. I think we have about eight now. But the other problem is that uh >> so that's at least eight people that can be tested at the same time.

[01:54:49]>> No, you can test each branch for four platforms. So it's two people that can be tested at the same time.

[01:54:55]>> Right. So it's mult there is a multiplying factor which is the number of That's right. So instead of eight if we had eight 8 machines that this that's the only limiting uh aspect right that's one limiting aspect the other limiting aspect is that if there is one test that is taking a long time and it can happen that because of a bug in a test one of the branches gets stuck for a long time it uh because it's done in batches this one long longunning test uh blocks scheduling the next batch.

[01:55:33]>> Right? This batch is multiple people, multiple branches, multiple PR. Is that is that what a batch is?

[01:55:41]>> The batch is uh any branches that were not tested at the time that the script started.

[01:55:49]>> Okay.

[01:55:50]>> It can be one, it can be 10.

[01:55:53]>> Okay. Yeah.

[01:55:55]>> Right now, I didn't I didn't catch that.

[01:55:58]We don't uh care about this sequence aspect. We we we test the PR as it is. If it's good, then we do the merging and okay that that part didn't catch. So in indeed our bottleneck is hard hardware I believe. Then plus this uh one test taking longer but I see the hardware more of a of the >> both are important like the uh the batched scheduling is really wasteful.

[01:56:27]It could like if we break down the global lock we could make a much better use of the hardware we have.

[01:56:35]>> But of course if you have more hardware you get more power.

[01:56:39]>> I see because I was trying to bypass the problem of priority saying okay I go first you go first. How about it's fast and and it doesn't matter who goes first. I was trying to think in that uh uh but it's not really possible. So >> that job scheduler that we are talking about seems to be a great limitation. Is that something that we have developed uh internally and now uh hitting all those limits or is that a I mean there must be thousands of job schedulers implemented all over the place, right? So uh is that something that we are just maintaining because it has been written some years ago and we are just keeping it rolling or it is something that we are maintaining because it's something that can run our code and there there are thousands of scheduulers out there but they cannot run our code.

[01:57:41]>> Yeah. And and when we are saying that it's running our code, what do you mean exactly by that?

[01:57:47]It's the idiosyncratic test that like does a merge in the middle and uh based on the result of the merge it decides uh which path it takes in the remaining half of the test and like stores this information in memory and never writes it down somewhere. So you need to protect this state that is only held in memory by the global lock that then prevents scheduling other tasks and so on.

[01:58:20]>> The test that the test needs to know.

[01:58:25]>> So what you're saying is that it's not just a scheduleuler >> like I would say that the scheduleuler is fine that the problem is that we have a test that is using in a complex and inefficient way.

[01:58:40]Okay. Yeah. But to me, it seems like we're hitting a major um scalability problem by using a wrong technology for something that we desperately need to have more efficient. So maybe rather than keeping up with something that doesn't really work, just change it for something else. And if and make it just schedule >> you say just a lot like those are difficult things but I understand that it's not just very hard. Well, I guess before going that part, we should probably first investigate how much of a problem it actually is. Which means probably do we have some data telling us how often due to this inefficient design it happens then one of the builders is idle when there are branches waiting.

[01:59:33]Jeff B >> because if it's say 5% 10% it's probably not much of a problem. If it's 30 40 well that's a different story.

[01:59:42]>> Usually you just wait at moments when it's the least least helpful to be waiting. So uh and also if we are spending quite some money on uh on hardware and we are not uh really using all of the capacity then it's just wasted money.

[02:00:02]I can't resist but do the quote from Jeff Bezos.

[02:00:06]If you have data and anecdotes and you don't and they don't uh agree the data is wrong and the anecdotes are right. I mean people are waiting on the build and that's the last thing you want. Like if people are upset about that's that's that's enough I think to now let's check if >> Sure. but that's what I'm saying is that it's two different things. One is whether we could do a better job uh ordering the tests, running the test and one thing is if this particular deta detail that some of the builders might beat idle even if there are branches waiting to be tested is really so much of a problem >> like we don't >> I mean if it doesn't does happen in theory or maybe by some 5% in practice it's a different uh from it being say B 30% of time. Yeah.

[02:01:05]>> Like we don't have the data of what's happening in the in the build system. So we can't say oh this person complained that it's that they are waiting for a long time and this this was happening and it was caused by this. We don't have this data. So like we have anecdotes of people uh being annoyed by things taking too long but we don't have the insight into what was the cause at that specific time. But we have some other anecdotes that are like from the from the uh from inside the system that yes it can happen that uh because of some failure uh the uh the use of the available resources is inefficient.

[02:01:52]>> But we also have to understand that there will be congestion points that days before um a milestone or submission deadline a lot more people will submit.

[02:02:03]Nobody will submit at night because people don't work at night hopefully. So it's not an easy problem. We will have we will waste resources if we want to keep the queue low at congested points.

[02:02:24]>> I think I can imagine this working even with keeping the design processing by batches.

[02:02:32]uh what you need to do is tune the size of a batch the maximum size of a batch so that you reach a desired scheduling delay. Uh so there's this problem you would have to perhaps deal with tests getting stuck as you mentioned. Uh but that's that's another problem. Let's say let's assume most of the time it works.

[02:02:56]Uh you'd have to deal with tests getting stuck. And the third problem is actually there might be some builders or some architectures that are slower than others. I can imagine power PC for example might be slower in compiling and testing a kernel. Therefore the x86 machines would be long idle and power runs for twice as long. So >> using cross compile.

[02:03:23]>> So I'm not sure. Okay. Okay. That's another factor if it's cross compiled.

[02:03:28]Uh so so if that there were some weaker machines, we might uh let's say resolve this by adding to the pool the machines that are weaker so that there are more workers for that architecture. That would somehow balance the builder the worker pool and then tune the size of the batch to reach a desirable scheduling delay, let's say of 20 minutes. that would um be the average scheduling delay you get uh the maximum scheduling delay you get uh be before your the priority job starts running and the last there there's the piece that uh prioritizes certain submissions to certain >> uh we are at the end so if you want to discuss this more uh we can move outside okay there is the thing that every which of batches introduces inefficiency because you need to wait for the last job to finish and the other workers are idle.

[02:04:35]>> Right. Right. That's understood. Yes.

[02:04:37]Okay. Okay.

[02:04:38]>> Great.

[04:25:00]One, two, three. One, two, three.

[04:25:04]You don't want to Okay. Testing. Testing. Good.

[04:29:09]Okay, good.

[04:29:29]more than 30 minutes. 30 minutes, right? And I'm going to show you 10 left, five left, two left.

[04:29:45]>> By the way, we got a feature request.

[04:29:47]New one.

[04:29:50]>> Test test test.

[04:29:52]>> I got in the elevator feature request.

[04:29:56]>> Oh my microphone. Yes, that makes sense.

[04:30:10]>> I was forbidden to do a mic drop because this one apparently can't do it.

[04:30:15]At least the one One minute.

[04:30:59]Look at the solid rings.

[04:31:30]Yay! Everyone here? Great. So I'm Marcella. I'm product owner of this project and it's working pretty well in my opinion. And this is my only engineer.

[04:31:41]>> Yes, I'm Eno. Um for those of you who who weren't present on Monday, um I do um both upstream and as an op side also downstream. Um and yeah, I do this for a couple of years now and it's now the third time that we have this presentation. So yeah, welcome to our yearly road map update >> and let's wake up after lunch. So who ever lo into Ortos and borrow machine?

[04:32:07]Good. Great.

[04:32:09]Who found working machine?

[04:32:13]Still good. Still good. Okay, we can continue.

[04:32:18]Um, so I'm product owner and I really wanted to know what people want more.

[04:32:24]So, I sent you a service sometime in November and I was really hoping that I could decide if it's more important to have working machines or doing more features.

[04:32:37]It was 50/50, but people were more interested in features. So, here we are.

[04:32:42]So, you were mainly asking for specific hardware. You missed latest release of course because it was November 160 was almost out. So yes um you complain that machines are not always installable automatically can happen and manual installation can be hard that again depends on the hardware and from hardware issues mostly we complain machines are not working but this is really wide topic there could be different reasons and then in case everything would be working great and anov was bored we ask for wish list so make BMC access reliable That's really fun one. So that's at the end of the list.

[04:33:28]And if your machines are working and accessible, that's a nice goal. Uh but if we have also machine owners here, it would be good if you look if the machine is at least green and if you can log in from time to time if it's still working because they stop to work. Uh ensure console works. I don't know. That's hard one. Well, with the next release, we will have a button for at least the machines that use IPMI that reset the serial console. So, if you get one of those dreadful um so already enabled uh uh things with the next release, you will get in the web UI a button to say now fix it and then you will go back to the terminal and it will be magically fixed. um because I found out how to automatically reset this um SO session.

[04:34:14]So that's a little bit of quality of life. But there will be scenarios especially with those pesky IPv4 IPv6 things because we have a lot of beta firmware where IPv6 is not working um where it still will be broken but um in the long term there's also planned fixes for that.

[04:34:30]>> So not ensure but magical button ensure that CP ID works. We will be talking about that. Provide current paralysis and releases. Well, it depends. We don't have so much space to store everything but at least should be done much faster.

[04:34:47]Uh machines should auto reprovision before offer to reservation. So that's happening >> un as will it happen again after I have fixed the bug >> and reliable installation of course uh but with keys and credentials. So that's in the middle of the list. Restrict number of reservation and longtimes reservation. So if you can't borrow a machine and see that someone has it for a year, that means he's owner of the machine and he doesn't want to give up on this one. You can try and ask personally, but uh probably won't work.

[04:35:27]Uh so something for our survey.

[04:35:30]Um I was curious why are you borrowing hardware? So mainly for the producing bug and special hardware and you can see there is some team access. So those are the machines which are borrowed for a really long time and I forgot to say that 70 people responded which is I guess quite good for susp.

[04:35:56]Um how hard is to find hardware you want? Well, it's hard to find functional hardware, but again, we would appreciate help with fixing the hardware.

[04:36:09]And now it got stuck.

[04:36:14]>> And how hard is to find the side hardware? So, people complain often that they can't find special CPU parameters or network cards.

[04:36:28]Yeah, we are working on that. And many times you responded hardware is there and is reserved or there and not working. Not great.

[04:36:38]And what do you wish to install? So most people were was happy. A lot of people were complaining about preises but that got really hit time bit bad time >> but we're also working on that. So the plan is so at the moment I'm working on other features but um from the priorities we have at the moment um the situation in regard to pre-releases will also improve but it's a lot of automation and the network se segregation that we have means that not at all times all images are available on the different mirrors. So there's a lot of sanity checking that I need to do so I don't have random error messages kicking around in the back end. So, it's one of those things that um just takes time to get right.

[04:37:28]>> So, what was done? I was hoping for much more features because I'm probably optimistic or whatever. Uh so, it was released new Ortos 2 whereas a 1.6 and you shouldn't probably care because and there's new jungo which you probably also don't care but that actually fixed some bugs on the web pages so that's good. and we don't have to care about it anymore. And there is a net box to or synchronization. Now demo please.

[04:37:58]>> Yes, I can demo that. Um let's see if Yeah, that works. So if we go to Arthos >> both demo, >> huh? No, no, no. We we start with production. Oh, wait. Can you hold a second?

[04:38:21]Yeah, obviously we switched to Zeus ID. Yeah.

[04:38:29]>> Who did that?

[04:38:30]>> Don't know. Ask the admin.

[04:38:39]>> Yeah.

[04:38:40]>> Nothing like live demo, right?

[04:38:42]>> Yeah. Nothing like demo. So um yeah so um this is a view that you guys normally won't see but for me it's a very um interesting view because essentially um the issue in the beginning was when we were think starting to compare data manually between netbox and orthos. We were like hm it looks identical mostly but this one field looks weird. So I was like hm would be good to not lose the data in the orthos database completely in case we overwrite it. Um so I was like okay so essentially we need just a change log. Um and the change log essentially is uh now seen in this format. So um as an administrator or power user um with administrative privileges you can just see the comparison what uh was saved in orus before uh what was saved in netbox and if the data is equal and this essentially you have for all objects that we're currently syncing which is not everything because some data is still exclusively saved in orthos um most famous example here credentials for BMC's um but also other data that is not yet in netbox and uh of course we will try with every released to um increase the amount of data that we just pull down from netbox. So the idea is and this is also in the documentation if in case you want to um register a machine with orthos in the user documentation um we essentially tell you to just create a netbox entry and I have then a very magic um as an administrator add machine thing where I just give it the netbox ID tell it what system it is and then I just say import and then it works magically um yeah so uh and that's already working for quite some machines but not for all machines um because especially there is like bugs um that I still need to solve but it's on a good way and it solves a lot of effort because now you just need to enter data in Netbox and Orphos is essentially becoming a secondary system that's essentially just acting behind Netbox and we are back to the presentation.

[04:40:51]Yeah.

[04:40:52]>> Yeah. So that were accomplishments and there were many more things to do but there was infrastructure work as usually something happened some unscheduled work so transfer from cells to unsele well I didn't like it because I would like to use this in infrastructure but we were waiting for approvals for pull requests sometimes days and it got really in our way so now we know I know it's owning Anible and wrote the anible scripts. So that's one thing. We moved from this s to because we are hearing that it will eventually go away. So we are away.

[04:41:36]>> Well um to not cause panic not complete this to will go away. um just the dire uh directory with the unpacked um installation images will go away which is /installs slp because the buildups team told me in the office in Nberg that essentially that's a directory they populate manually and from their usage logs essentially there was not a lot of usage visible so they were like okay we do this now on demand and since this was yet another thing thanks to the network separation that uh was especially in Nberg um hello yas wonderful helpful view always. Um uh since uh in Nberg the cobbler server that is serving all the images could not access that and we had timeouts because some of the firmware was very um latency sensitive so it would just run into random timeouts.

[04:42:25]Then we switched to local ISOs that we mounted on the server and everything was magically working. Um and we didn't want to take TCP dumps until eternity. We just moved away from mounting NFS um over van and uh have moved to an terraform which I talked in depth on Monday about thing was society migration. We know about authentic like for a year and we're preparing because it was little scary. Uh one thing is sure we could enable s console access even for people like me who joined recently but you lost access to NFS homes but uh nothing to do about it that was decided long time ago uh what next fixing hardware that's always a problem because sometimes people just came and say nothing is working I can't find a machine and know fix it >> and so I did um yeah and so I did um essentially the um major thing that I want to highlight that here is a shout out to Yas essentially here because uh last year in October we were like busy for I think one and a half months um upgrading uh the HPE blade center to the new management blade. Um so the blade center is essentially machine where you can put other machines in and that has a management um server and that was still on the old outdated unsupported version.

[04:43:47]we had for like two years already the new version that is supported lying around but no one got had the time to fix it and reinstall the new version because it was of course a breaking change. Um so um the main reason why this was um not done was because IPv6 wasn't fully working. So we had those lovely flaky connection works connection doesn't things with the management blade. Um and after very some in-depth investigations with Yasik again here.

[04:44:13]Thank you. Um, we managed to upgrade the both the software and hardware stack for the management UI and um enabled IPv6 for all uh blades that are installed in the blade center.

[04:44:26]Yeah, doing that at once was funny because one thing was affecting the other and uh because of course um the management blade also had firmware updates for the internal switches and stuff like that. So yeah um it was a long road but we got it done and that took a long again a long time.

[04:44:44]Yeah, one and a half months.

[04:44:48]>> Okay, so that's why we got stuck on some features. Uh then fixing network issues many many times and many other hardware issues and the most funniest was unpl unplanned migration in PRA 3, right?

[04:45:05]>> Yeah. Yeah. Essentially um the idea was hey we have a we learned from Prague 2 E and we migrate to PR 3 with a lot of planning up front and all the documentation everything and then I was like I need this and then I came back from Christmas vacation and then we were like yeah next week we migrate to Prague 3. I'm like where is my cobbler VM? What cobbla VM? I see. Okay. So then we improvised and again Yasik and Latos helped a lot and we uh worked around everything and uh then um after some time uh Peter from um um approached me and was like hey I know um my machines don't get IPv6. I was like hm weird let me check the logs but I couldn't for the love of God see anything. So again, Yas, Peter and myself went to debugging and Yas and Peter discovered that there is a bug with the top of Rex switches and there is a pending firmware fix for the top of Rex switches that will be hopefully rolled out by the end of uh this month.

[04:46:04]>> H so sorry months without IPv6.

[04:46:09]>> Yes.

[04:46:11]>> And that's how we are working here.

[04:46:13]Yeah, we have a workound but um the it's working now but um until it's properly working next month.

[04:46:20]>> Yeah, everything cool. So right now uh Nazara project is being worked on next release of Ortos that should happen after you are back right.

[04:46:32]So main theme for this Ortos update is fixing the searches. So everyone will be happy about that and we are removing the client. Not happy about that, right?

[04:46:43]>> No, exactly.

[04:46:44]>> But it's happening.

[04:46:45]>> Yeah, that would be a question.

[04:46:52]>> I'm product owner. I can say no. But yes, ask your question.

[04:46:55]>> Sure. The question is I um uh I I learned about this uh direction that the project is going of removing the command line client. Um so okay. Uh uh we relied performance team Javanni Gage. We rely on that uh client for booking the machines. Uh so we know who has the machine of course Ortos and uh we use the client in a programmatic way like we had our own little script to and without the client uh we still want to do things in a programmatic way right.

[04:47:33]So what I'm coming out right now what did we do? we uh following the prescription of the project we use the the rest API because of by I wrote my own client essentially right which uh while I was doing that I was thinking that that doesn't end well because uh I I while I was writing my client I was looking at your client in order to do the same but as your client doesn't go away then when the rest API is going to do something else etc. I do not have a reference. I mean the reference then I thought what is the reference? Maybe the website. So I went on the website and look does the website call the rest API somewhere. Exactly. So confused.

[04:48:25]>> Yes.

[04:48:25]>> Um I understand the point. The issue is that um okay so Orthos is based on jungo meaning uh the web UI is rendered on the server side. Um meaning that web UI is definitely not using the REST API. The rest API that was created a long long long time ago um is essentially not following any reasonable standard whatsoever um that you would expect as a normal developer working with um REST APIs. Um, and to fix that, I would need to redevelop both the REST API, would need to redevelop the command line client, um, and would need at the same time do the operations work for Orthos, um, implement the features and everything. So, understand it, but there's another, uh, solution, which is just leaving the client where it is, you know, so >> seriously like, that's not the solution because it will break with 1.7. So this is why is is being removed. So this is was why it was marked deprecated with 1.6 and this is why it will be removed with 1.7 because one of the um side effects of um the search fixing and everything was to do a lot of work underneath um or in the back end and this meant that I would need to keep compatibility for the CLI invest like three weeks just to keep give you the same search results.

[04:49:47]>> Okay. And >> ah because you because you are making 1.7. Okay. I see. And this is this is why it will go away with 1.7. Um, accidentally you can use the client from 1.6 and it might work um and it might not. Um, and if you find a bug in the rest API, it will be a very high priority for me to fix it. But I >> I don't want to find a bug, you know.

[04:50:06]>> Yeah. I also I I also don't you want to I also don't want you to find bugs about >> the mushrooms that I go to look in the forest. I don't like bugs, you know.

[04:50:14]>> Yeah. Um, but yeah. Um uh yeah that's um >> but anyway what I appreciate from this exchange is that you are trying to understand the the point of view that uh and I appreciate that that there is an exchange where we try to both see the each other >> um shoes you know >> the the idea is for for user for normal users to use the web UI as much as possible and the idea is that if you have automation needs um essentially I will um on an ad ad hoc basis. Whatever endpoints you need, I will supply. They will not be team exclusive. So I will not say um slash slash API/formance team but it will be a generic API but I will just add the endpoints that you guys need because again the web UI is sadly not using the rest API and as such um every REST endpoint that I add I will just add because someone has an automation depending on it. Yeah. Um so if you have needs um I'm very happy to fulfill them. Um and again the command line client was never built for automation. So the fact that it you was working well with automation was an incredible feat there. Um so yeah very happy coincidence for you guys. But again it was never built for auto with automation in mind. Um yeah and uh now to um >> finish the currently underway features.

[04:51:34]Um Cobbler 4.0.0 zero um is essentially um what I'm currently trying to stabilize um on the site next to getting 1.7 out of the door um because um as you know Agama is at the moment not installable um headless so essentially without touching the installer yourself um so that is essentially why I will be working on Cobbla 4.0.0 Z after 1.7 is out exclusively.

[04:52:01]Um and that's essentially what's also not yet deployed. So Agama support is work in progress still um as the highest priority and because it came essentially for free along the way um cloud init support. So we also have images um available that do cloud init and that will come with the new cobbler version 4.0.0 um as well.

[04:52:23]>> Yeah. Hopefully finally nothing breaks and you can finish it up. Uh so what's next?

[04:52:30]Um suicide immigration. We have only 10 minutes. So these things were done and now it's time for another demo because lots lots of people complained a lot about putting your password into Ortos your SUSA ID password.

[04:52:48]So this is resolved in staging.

[04:52:52]>> Demo done.

[04:52:53]>> Yeah, demo done. Um it just works. Um after the effort that I did, um thanks to it. Um I literally had was able to set it up yesterday afternoon at like 5:00 p.m. So um yeah, this is like fresh from from uh the from Git. Um but production will take a while because in the back end there are some issues that I need to fix. um because of of course Python version in inconsistencies between tumble beat and slless but that's not an issue that's just work I need to be doing okay so that's society that was a lot of fun because and I was cooperating with uh how is called >> Jose >> with Jose whole year and looking what is changing on how we will be working with that so So it was really helpful also for rest of us who move to society later on Nazara project. So that was that is important for machine owners if especially if they have a lot of them and it was started by our interns I guess.

[04:54:04]So it's application which should create or update netbooks devices and virtual machines and it's manually executed. So you don't have to be afraid that you spend hours with net box and then everything will be gone. Should be safe.

[04:54:20]>> Yes. Um it's started as she said rightfully so by interns and apprentices. Um we don't have a demo because um technically if it works nothing changes. Um and if it doesn't um obviously the data will be lost. So that's why we only allow manual execution. Um I however automated everything as much as possible that the only thing that you need to be doing as a machine owner um in the configuration file is add your netbox machine or virtual machine ID and then you can just run the tool. Um once uh everything is satisfactory to my quality desires um we will have in the monthly ortho zoom call that we have with all power users and stakeholders that are interested in the monthly update. Um uh we will demo this and we will show this um and again uh this is nothing that we will rush um but this is developed outside of SUS so this is not something Soua specific um however we have Soua specific integrations so also custom fields and everything is accessible that's specific to our SUA netbox instance and yeah um that's underway.

[04:55:30]Uh so this is rough road map. I stopped planning because everything is breaking all the time. But we should provide agama and cloud in it. Finally that should be done soon. The next one, Anna would like to put Ortos into containers so we can deploy quickly and safely whenever he wants. And it should be easier to deploy new images.

[04:55:56]Well, even betteras RC's, which is not done much often. It's taking too much time.

[04:56:06]And we will move even more data into netbox. So the idea is that there will be most of them and only something will be visible in autos. We will see that's far away because definitely many things will get broken and he will have to fix many machines just happening. So if you need help you can contact us on help or someone will probably give you advice or fix the machine unscrew time and that's it.

[04:56:48]just just as a preface um when we don't get the questions done in time because we only have five minutes left um then we will of course be able to continue this offline um I the only thing I have to do is drive home today so uh we will have plenty of time for discussions after the talk >> so uh my question is uh regarding usage of the Ortos machines. So do we have any insight how many of those machines that are sitting in data centers are being used actively less so are broken for a year or stats like that essentially >> um hard stats that you can check out yourself? No. Um what I can see at the moment is that um we have roughly 150 machines permanently reserved of currently roughly 630.

[04:57:46]Um so that's that. Um coincidentally according to the orthos monitoring that I'm checking periodically those um permanently reserved machines are those that are in the best state. So according to the monitoring they can be pinged both on V4 and V6 and they have SSH access and orthos can scan them. The machines that are at the moment in the pool and are unreserved are machines that are most of the time flaky due to reason X Y or Z. I don't have data on that at all. Um I'm trying to switch the monitoring for availability to Graphana and Prometo so we can have long history graphs and stuff like that. But that is from the side of priorities at the moment not very high. So um I can't give you a dashboard where you can check that out yourself. But I have a prototype that's technically working. Um but integrating that into Orthos is so far uh not yet done. Um so >> okay so there is a monitoring it's not generally available for everybody but uh there is some insight. I'm mostly asking with a clear intention here because uh um idle machines are expensive machines.

[04:58:58]They are uh both contributing to our power budget that is really low that prevents installation of a new hardware.

[04:59:05]We are hearing a lot of push back from it from getting new machines. So uh is there a time or will there be a time to do a major cleanup and get rid of the old rusty hardware that nobody's using for some time?

[04:59:20]>> I am operating under assumption that the firmware that the machines that are in the orthos pool are broken because many of the machines in the orthos pool are Intel SDP machines or prototypes from our partners. meaning that um we have um no option to reliably shut off or turn on machines. We have both issues with independent of we take a power bar that turns off power on the wall on or off.

[04:59:50]Um or if we use a BMC um both methods are not reliable. This is why Orthos when you click reboot tries in the corresponding order um SSH by BMC and then a PDU um because there's no reliable way to actually ensure that the machine is being restarted. Um we have especially in the PR data center machines that do reliably work. Um this is there a cleanup like this would be easier. Um, but I would would like to see that this cleanup is being done by the machine owners because I provide the infrastructure and the framework for you guys to pixie boot and to install. But, um, I can't I don't have a political say about this machine now is being shut off or turned on. Um, if you want to automate your own machines that you own, feel free. But, um, at the moment, I don't see it from my side, at least with the bandwidth that I have. no larger option to over all machines to turn them on and off depending on the usage.

[05:00:57]Um, is that a satisfactory answer or detailed enough?

[05:01:02]>> Good. Okay. And I think then uh now we run out of time so we take it offline.

[05:01:08]>> Okay, that's fine. Good. Okay. Thanks everybody and see you around.

[05:04:13]Hello hurry please like session started already this is a discussion session Microphone test.

[06:14:14]>> Microphone test.

[06:14:16]>> Does it work? Oh, yeah. I think it's work.

[06:14:20]>> And and one more note for don't go next to the uh speakers because then you get roof >> and when you see the red light on the camera, >> we are actually recording streaming.

[06:14:31]>> Yeah. like in DB.

[06:14:33]>> I see. I think it's still don't work.

[06:14:36]So, can I use yours?

[06:14:38]>> Yeah. Yeah, sure. So, how about you?

[06:14:40]Sure.

[06:14:41]>> Yeah, just slice to >> Yeah, >> I shared it to you.

[06:14:59]>> So, does it works? Uh to see where can I find >> I sent you with the Slack but the mail >> uh no with dislike >> uh if you don't have a SL then I will send you with the other channel >> embarrassing the first time with the the new setup So it's not >> like >> Yeah.

[06:15:50]>> Oh, what should Okay, let me make it public.

[06:15:59]done.

[06:16:06]>> And >> Mhm.

[06:16:11]>> All right. So, let me >> So, so, uh, yeah, I'll move this >> to the other screen. So, yeah. Uh, >> put this on there.

[06:16:22]>> Yeah, >> I'll just put it away.

[06:16:25]>> And then Okay, >> this is a long long is a long session 55 minutes >> and you 10 minutes left Mhm.

[06:16:57]>> 55 minutes.

[06:17:01]>> Is it great? Uh >> try.

[06:17:04]>> Uh I think it works. So yeah, >> make it larger.

[06:17:12]>> Yeah.

[06:17:14]So we'll do it myself.

[06:17:19]>> Yeah.

[06:17:21]Thanks.

[06:17:39]Sorry.

[06:17:50]>> So that's it.

[06:18:00]So is it running? Okay.

[06:18:04]>> Okay.

[06:18:06]So welcome. Uh today we're going to uh look into the L kernel tracing internals. So I'm going to talk about how kernel event sources works and their execution path. And finally uh I'll talk about the BPF that works on top of this event sources. So let's begin.

[06:18:25]So I'm Huan Lee uh from Soul South Korea and I work as a BPF system engineer in a hardware enablement type A team especially working on a link kernel BPF.

[06:18:37]So before we dive into the internals, let's start with a basic question. What is the tracing? Well, we could simply put uh tracing is an observing what the kernel is actually doing.

[06:18:53]So, what can we use for the Linux kernel tracing?

[06:18:58]So, as you might already know, we commonly use this event uh these major event sources such as K probe, FRA, uh trace point and perf event. And we could roughly categorize them as a dynamic instrumentation and a static instrumentation and a sampling.

[06:19:19]And on top of this event sources, we have a BPF, the programmable layer uh that works on this event uh top of these event sources. We'll come back to this later.

[06:19:32]So in this talk, I'll go walk through the dynamic instrumentation with the K probe and F trace. Then the static instrumentation for the trace point and the sampling with a puff event and finally with the BPF and the scope of this talk is only focus on the x64 uh and only the basic execution path of these tracing events.

[06:19:55]First let's clarify one concept instrumentation.

[06:20:01]So instrumentation means inserting an observation point into a program uh program execution and when the program reaches that point uh we collect data at the exact moment and the execution continues. So this is the basic model behind the most tracing mechanisms right and the most primitive form of this instrumentation is on print K. So we print K, we put print K in the code, print something, continue. Well, is it pretty useful for debugging, but as a tracing mechanism, print K has a problem. So it does not scale. So especially it is too expensive on hot pass and worse, it perturbs the timing sensitive behavior that we are trying to observe. And of course if it is not already there we need to change the kernel source and rebuild the kernel.

[06:21:00]So we need a better instrumentation. So it should have a lower overhead and it should perturb less and we wanted to enable or disable at a runtime without changing the rebuilding kernel.

[06:21:15]So for instrumentation based tracing there we have a two practical choices either we use insert an observation point at runtime uh which is a dynamic instrumentation or we could just use a observation point inserted at a build time uh that is a static instrumentation.

[06:21:32]So let's start with a dynamic instrumentation.

[06:21:37]The first one is the exceptionbased kro.

[06:21:40]So on x64 the classic kro pass uses the in3 which is a breakpoint instruction to understand kro better let's start with a simpler one what is a breakpoint really well here is a typical GDB session uh start setting a breakpoint at main and how does the GDB actually stops at main does GDBs directly stops by the itself the process well you might already know the answer GD GDB patches the instruction stream.

[06:22:12]So more specifically, it patches the first bite at the target address with the entry in x64. So which is a breakpoint instruction. So when the CPU executes the entry, it will triggers the breakpoint trap. So uh then the GDB uh stops the program. So the entire debugging flows goes like this. uh the process excuse the inry and CPU raises a break point and enters to the kernel trap pass and the kernel trap pass will send a sick trap with the P trace mechanism and GDB will regains the control yeah that's it and the K prop uses the same breakpoint idea but there is no GDB and the user space signal is involved the kernel patches its own attacks and handles the trap entirely uh inside the key probe framework.

[06:23:08]So now let's bring this back to the kernel. We need a small target for the next few examples. So I brought a um get UID example. So Cisco, it is pretty simple Cisco. It has no arguments, simple implementation and easy to recognize.

[06:23:25]Here's a disassembly of this get Cisco, pretty short and even with the C definition, it is an oneliner. It just fetches the current UID map to the uh user name space. We'll use this sys call for the next few slides.

[06:23:40]So now let's attach K probe to this get UID Cisco. Well, traditionally K pros were often used with the small kernel modules like this. So we define a K probe structure choose a target symbol and offsets. So here we use a deuces get UID which is basically UID sys call and use the offset zero which is I will uh test the krop to the first entry of the uh doyses get uid and also we choose the uh we defines the handler and calls the register krop and now let's see what actually happens under the hood.

[06:24:19]So before we patches the instruction krop saves the original instruction bytes first. So in this example the original instruction with the offset zero is a push qrbp. So we'll save this push qrb to the kro structure uh with the a instruction.

[06:24:39]Now uh after the backing up the original instruction the first bite that the target rest will be replaced by IN3 and when the next time get UID runs the CPU will excuse the in3 and raises the break point and traps it into the kpro framework. So within this trace point uh sorry within this kro framework trap uh the initial the first thing that will uh run is the pre-andler. So this is basically where our observation logic could be run. So for example, we might inspect some registers or we could collect arguments or we could update counter etc. But uh after running the pre-andler the original instruction still has to be executed right. So k probe you excuse it the copied instruction through the step called single step. You just basically run the original instruction and after that the original instruction executes uh k probe can call the post handler which is basically the same as the pre-end handler but just the difference is it's just executed after the uh single set we you could also use this post handler for your observation.

[06:25:53]So and after the post handler uh finally execution uh execution resumes normally.

[06:26:00]So like the from the functions point of view the original instruction is still executed just way it is. So everything is fine right.

[06:26:10]So now let's use this mechanism for a simple example. So counting get UID sys call wtk pro.

[06:26:18]So here's a small kernel model similar to the previous example. At the top uh we define a per CPU counter uh and here the uh handler pre uh is a pre- handler that runs whenever this get UID uh sys call is head. So inside the handler we uh just increments the per CPU variable.

[06:26:39]So it's pretty simple and when the model access uh it will just sum up the uh per CPU counter and prints out to the lo kernel log buffer how many times the get UID is called. Well, the code looks simple, but we have to consider what will we just pay for this tracing.

[06:26:58]So, every get UID code takes the full pass that we just walked through. So, it goes to the inquiry, raises the uh break point, goes to the K probe and calls the pre-andler single step, post handler.

[06:27:13]So it takes a lot of steps for every single uh graduated Cisco.

[06:27:20]But so why did the K probe need break points like entry in the first place? It has a lot of overhead. But why did we use intry in the first place? Well, the kernel did not provide any built-in observation points. So uh well, if uh the kernel has provided any observation point, we could just use that. But since the kernel didn't provide it initially, so we just use patches the instruction with the inry and we could uh just put our observation point anywhere in the kernel code. So K probe solved this by dynamically creating observation points pre-endular single step postuler. But this comes with the another question but what if the kernel code itself already contain the observation points? So I'm not saying the the prek or not the real logic. I'm just saying just an instrumentation point that just do nothing by default but can be enabled when the tracing is needed. So that's the idea behind the frace. So instead of using the entry uh traps fra uses the instrumentation sites already built into the kernel and the kernel simply patches those sides at runtime.

[06:28:35]So what does this frrace instrumentation site actually looks like on x64? Uh the compiler insert a five bytes nope uh at the function entry. So it does nothing by default. But you might wonder why five byte nodes because this five bytes is the exactly same size as the relative call instruction in x64. So at later at runtime frace framework will patch those five nodes with a call 2D observation handler.

[06:29:08]So to compare how this actually looks like in practice I brought uh two kernel uh binaries on the left is the without the f trace on the right is a built with a frace. So on the left with the get U ID just starts directly with the push Q RBP but on the right with the F trace enabled you can see that we has a five byte padding at the first which is has uh noted as a nope l RX RX which is basically an x86 multi multi nope notation.

[06:29:43]So now let's enable the fries function treaser for this gate ID cisco. So well you already know uh from trace fs kernel tracing we can set the s uh set f trace filter to the do sis get uid and select the function trace for for the current tracer and do the current tracing on and now let's see what gets patched once once f trace gets enabled so when the f trace is enabled the five bytes do is patched to the call instruction so you You can see that with the zero fff06 something. So instead of doing five nodes the function entry now calls to the function name fra trampoline. So you could verify which is a which function is. So here I brought in a uh GDB screen which can confirm it directly from the live kernel image with the prok core and uh the the address with the FFF C06 is actually um maps to the FJ trampoline when you use with the pro kims and on the right uh this is a simplified version of the F3 trampoline uh which is actually bind to that FF06 address. So what it does is pretty simple. So it saves the minimal state and walks through the registered fra operations and calls the call back and returns back to the original function.

[06:31:16]It's pretty simple.

[06:31:18]And in our setup uh the selector tracer that we used was a function. So it just goes to the function tracer. So here function tracers looks like this. Uh what it does is pretty simple. It just records the function IP and the parent IP and into the trace buffer. Well, that's it. And this is the place where our observation could be run. So from the ops fun you could your uh do put your own observation logic and after that trampoline restores the states like uh it stores with the CPU state and returns and then the execution continues with the real first instruction which was the push qrb right. Uh the one thing different with the kro is it uses the trampol not the exception trap. So at this point we have only looked into the function tracer uh path but at the same frace path can also be consumed by the the other tracers such as a function graph tracer is well known or we could use a BPF trampoline f entry or exit programs we'll revisit later at the BPF section.

[06:32:31]So now let's use f trace for the same example again counting get uid with the f trace. Uh we here we use the function tracer like before here we set the filter to the do get u ID and a function same then every uh matching function entry will generate an a trace record at the trace buffer. So for a quick check you could read the trace output how many times it called or you could simply use the war count tool. So here the 36 times has been called with the getu ID.

[06:33:04]Before moving on let's summarize the trade of the uh dynamic instrumentation.

[06:33:08]So K prop is a flexible it patches the instruction directly. So it can attach almost anywhere inside the function. So as long as this the address is on the instruction boundary you could patch it to entry but every hit pays the trap cost. So it's quite expensive. On the other hand, fra is pretty much lighter.

[06:33:29]So instead of trapping the CPU, it redirects the execution through a trampolony call. So it's just a function call. So but since the instrumentation side is compiler insert inserted. So it's like five pipe nodes. So it is kind of mostly limited only to the function entry and the exit compared to the K group. So both are dynamic instrumentation but with a different traoff. So flexible uh flexibility versus the cost.

[06:34:00]So far all of this mechanism still intercept execution at the runtime. That means there is still a runtime cost for to intercept it. But what if the observation point was already there like that's the idea behind the static instrumentation. The observation point already exists in the compile kernel and at runtime we do not need to invent a new place to observe we just enable at the existing site.

[06:34:30]So the most well-known example is a trace point or a static statically defined low over observation point designed for tracing and maintained by the current uh maintainers per subsystem.

[06:34:44]So before the trace point, let's revisit why print K does not scale well for the tracing. First, well, preincade performs a string formatting. It has to parse, format strings, handles variable arguments and converts values into text.

[06:35:00]It cause a lot of uh overhead with the printing and also print k writes into the kernel log buffer which is just a global unassured. So when many parts of the kernel want to uh print at the same time they will compete for the same login path. So yeah they will have some overhead and finally the output is just an arbitrary tax. So different code site might print similar informations in a completely different uh format. So that makes filtering and structured analysis bit harder.

[06:35:38]But trace point solve these problems differently. Instead of using the string formatting, trace point uses the fixed event schemas with a lower overhead. And instead of using the global kernel log buffer, trace points write event data into the dedicated trace uh tracing buffers.

[06:35:58]And instead of unstructured strings, trace point just use produce a structured event data which makes events to easier to filter, process and aggregate.

[06:36:10]So then where the trace points are located? Well, they are already built in the kernel execution path. So for example with the 64 sys call path, I brought an example like this. Execution goes like this. Like the actual sys call is called at the bottom green box. do Cisco x64 and the trace point fires at the orange box the trace CIS center. So the important point is Cisco trace points are fired before the actual Cisco handler runs now that we know where the trace point are fired. Well let's look at what the data they record. So this is the trace point defines a fixed event schema and for a Cisco entry and you could see that it records fields like the Cisco number with the ID and Cisco arguments uh up to six. So with the lower you could see the TP fast sign uh this is just fill those field with the current register state.

[06:37:11]So you can see the rags. So yeah, so instead of printing an arbitrary text like print k it trace point reproduces the structure event data to the par cpu trace buffer with a fixed schema.

[06:37:26]So again now let's use trace point for the same example once again counting get ID with the trace point. Well this time we'll use the trace event uh interface through the trace fs. So if you go to tracing sys tracing uh you can find the events sys call sis enter get uid enable and if you do tracing on you will get the uh trace events like before uh you could do just manually count how many time was called or use the work count tool again here that was 29.

[06:37:59]So so far we have looked the kro f trace and a trace point for the same cisco.

[06:38:06]Oh, there are different mechanisms but all of them are event driven. So they're good especially when you want to see how the kernel behave precisely.

[06:38:18]But this events driven tracing has one important characteristic. Every event should runs the uh tracing pass which also means every event pays the tracing cost.

[06:38:31]So I brought a simple benchmark comparing these several tracing mechanisms. So the left most bar is the base without tracing enabled. Uh the second one is a trace point and third one f trace and the uh and the next one is a kpro. The baseline reaches about like 15 million events per second and trace point and f trace are roughly the same about like the 11 million per second and krobe is about 9 million per second. So you could see the overall causes the kro takes the most and f trace and trace point. Well here we use the raw trace point but it's much roughly the same.

[06:39:12]So event dri uh event driven tracing pays cost per event. So as the event frequency increases then overhead will goes to. So like for example with the example uh here if the function is called like million times per second the observing every single time maybe too much. So sometimes we need another mode of observation but not tracing every events but just enough event to understand the kernel behavior.

[06:39:46]That is sampling.

[06:39:48]So sampling replaces observing every event with the observe when a counter overflows. So event increment it's a counter. So I brought a diagram here.

[06:40:01]Events increment counter and sampling happens only on an overflow. So in this example uh the sampling period is seven.

[06:40:08]So we'll only sample when this counter reaches seven 14s and go on.

[06:40:14]So in Linux this sampling model is exposed by uh via perf event.

[06:40:21]So what does perf event actually defines? Well you can see with the perf event attribute the perf event attribute structure define three major things.

[06:40:31]What event to count when to sample and what data to observe.

[06:40:36]First we choose what event to count.

[06:40:39]Well, it can be a hardware event like using a cycles instruction cache misses or even you could use a PMU which is a hardware counter inside the CPU or it could be a software event or like uh context switches page fault CPU clock semantic events or you could use a trace point back events or caper back events you could do anything. uh in this example here we use a CPU cycles so I uh that's why the type is a per type hardware and the config is a perf count count hardware CPU cycles so yeah second we choose when to sample perf can uh sample by the counter overflow uh which is like the most basic one or you could use a clock based event uh like timer based sample and here uh here's a sample period is one one two three 100,000 uh cycle uh 100 one 100 cycle so we take one sample roughly every uh 100,000 CPU cycle so I have brought the scale as a 10,000 so the 10 20 30 yeah it goes up third we choose what to observe uh puff event sample can include many kind of an data the instruction pointer or user kono stack trace p T CPU number register information call chains or ATC. So here uh we use a sample type as a per sample IP and a per sample call chain. So each uh sample records the current instruction pointer and the call chain.

[06:42:22]So to sum up this example that we define with the per event attribute. So it means like every 100,000 CPU cycles uh sample the current IP and the call chain. So for from these samples we could identify the frequently executed decode path.

[06:42:42]So now let's perfect for a different kind of question like before. So instead of counting events let's say as what is the uh where is the CPU spending a lot of time.

[06:42:54]So here's a pro event setup that we've seen previously. So we count CPU cycle sample every 100,000 cycle and collect instrumentmentation pointer and the call chain. But in practice we don't really actually use the perf we don't really code like this to use perf. So then we usually use a perf tool for this with the perf.

[06:43:17]uh the the command below is basically the same behavior what this the C code does and after we use a per record we can now look at the collected samples so for maybe we can ask which process appear most frequently in these samples so as you can see uh the R sync with the 81% is the top soc accounts about 81% so that suggests most CPU time was spent running on the RSYNC related work and since we are also recorded the code chains with the sample we can also use which execution path was the hot so in this example many of the samples were flowing to the x64 right 60% so so we could conclude that major hot pass in this system was the right cisco pass with the rsync well it's obviously And actually these sample call chains could also be visualized as a flame graph uh brand by uh developed by the Brandon Greg. Instead of like reading and pinpointing the text report, you could just easily identify with this graphical tool. Yeah.

[06:44:35]So so far we have covered var sources and how they enter to the tracing logic.

[06:44:41]uh we've covered the event driven and the countdriven and uh we could separate them as dynamic static instrumentation sampling but that's not the whole story for the tracing well event source decides where and when to trace but the missing part is who consumed this event that's where the programmer layer kicks in so BPF is not a new tracing event source instead is just a safe and flexible program layer that can attach to the existing event sources. So like cap frace trace point per event that we covered before.

[06:45:22]So so far we have covered multiple event consumers for the frace we have uh covered the function tracer function graph tracer and for the trace point trace event subsystem and puff event use the puff event tric or puff tool commands like puffer core but all of these consumer has one thing in common which is an fixed counters so these are configurable but not programmable Right.

[06:45:54]After it does the same print out and trace point just prints out the event when the cold. Yeah. The power also do the same.

[06:46:05]But the real tracing questions are uh often more specific. So previously we've only asked how many times the get UID sis call is called. But often we only we want more complex questions like list the top processing uh process calling get UID at least more than 100 times.

[06:46:26]These questions are pretty hard to answer with the fixed consumers. Right?

[06:46:32]So this is where BPF programmably becomes useful. You could write your own observation code with the BPF and it could be run on top of these event sources. And let's see how it's done. So first we define something called BPF map which is basically a key value storage uh that can be used inside the BPF program and here uh we use a BPF map type hash. So I brought a hashmap like this and we program our observ observation handler like this. Well, it first fetches the current process ID like BPF get current P ID T ID and map lookup uh and does the lookup for the map for the matching entry with the P ID and if it's found it will update the counter like this with the red one and if it's not just set as one and update the entry it will program is pretty easy with the BPF and after writing our program logic we annotate this function with the uh section. So you could see the sec with the k probe and a function name. So which means we'll be using a k probe as the tracing event uh like we seen before. But at this time instead of calling the pre- handler with the kernel module, our BPF program will behave as an event consumer. So it will go through all the kpro mechanisms since we're using the kro trace events. So it will go to the process execution execute the entry raise a break point it jumps into the cable framework and calls the pre- handler but instead the kernel pre- handler the BPA program will be executed as the event consumer well I've just uh covered the kro here but you could also use a different event sources like f trace or trace point so if you use a f trace with the f entry uh as you can as we seen earlier in the fra trampoline. Uh there's an uh registered F trace operations. So from there you could use the BPF for a consumer or with the trace point uh with the traces center uh you could adjust a BPF at here to uh consume your trace point events.

[06:48:50]So BPF by just changing this notation here you can attach your trace point F entry. So it's pretty flexible. So you could use another trace event sources with a same similar uh BP program.

[06:49:07]So now let's briefly go back to this program uh programming perspective. So couldn't we do the same program with the original Kpro interface? So Kro uses a kernel model. So it can implement anything same like BPF, right?

[06:49:25]Yes, you can do it. But now look at the amount of the co kernel of code we needed. So it's not a pretty long but compared to the BPF it's pretty much and if you scheme through it we need to define our own hash table hash entries and also you have to look up into u iterate into the hash table update and you also have to allocate memory for the your own entries and when you try to uh clean up your module you have to uh clean up the u allocate the memory also.

[06:50:00]So the problem is not the kernel modus cannot do it instead the actual problem is uh with the kernel modu the responsibility becomes much more larger when you use a kernel motor. So if you want to write your own kernel module just to for the tracing you have to consider the memory allocation lifetime management locking cleanup and then all these things should be handled by developer.

[06:50:29]So and because we are now executing custom logic inside the kernel we also have to think about the safety. So the program we have to think that the program may not terminate or it might performs an invalid memory access or it might corrupt the kernel state.

[06:50:47]But BPF takes a very different approach and when a BPF program loaded into the kernel it first goes to the BPF verifier. So this BPF verifier checks whether this the program is safe to execute in the kernel. So it checks like does the program always terminate or does every memory access uh in the program is evaluated or does the program uh safely interact with the kernel and if these cannot be program uh proven the program is rejected before it is loaded uh into the kernel.

[06:51:26]So BPF tracing provides uh three important characteristics for the program programmability. It allows you to write custom observation logic or with the flexibility. You could attach your BPF program to the different tracing event sources other than the K probe. You could al also with the puff event or so and uh it provides safety because like it checks whether your BPF code can be safely run inside the kernel.

[06:51:53]So finally the last last time one last time let's count schedule ID with the BPF tracing. So the BPF code here is same as before we have seen uh so we have to compile into the BPF by code here we use a clank to compile and the result looks something like this. So just a BPF instruction code and then to load the BPF program into kernel we could use a BPF tool pro load command. So this program will uh the loaded BPF program will be checked by the verifier whether it's a program is safe to execute or not and once it's passed this will be attached to the K probe event since we use a K probe. So and also we also use the option with the auto attached. So it will automatically attach to the K probe and later when the get UID sys call is called our BPF program will be eventually called through the Kro framework like the same path the K probe has uh take and from the inside handler the PPF program can be executed and can be work as an consumer event consumer.

[06:53:02]So it goes all the K4 mechanism that we've seen before and after the uh result uh we could see the get UID containing uh get UID counting uh per process and we could collect through the BPM map that we used before. So I've used the BPM map a BPM tool map dump and we use the format with the jq and you could see uh we could list the top process who called the get UIDS call it um number of times.

[06:53:37]Yeah. So to sum up BPF does not replace event sources. Instead it just reuses the existing tracing infrastructures like K probe, F trace, trace point and puffy bands. So today we explored several tracing mechanisms from K probes to the puffy band and a BPF and each mechanism had an different trade-offs flexibility overhead and programmability and safety. But in the end the key is asking the right question and choosing the right event sources and the consumer. So I hope this talk help uh help you better understand how Lun current tracing works and thank you for your attendance. Any questions?

[06:54:38]>> I have a question. If you by chance know how this single step is done in that K probe frame framework.

[06:54:45]>> So, uh that's the point. I've just opted out because it might be too much. So, let's get back.

[06:54:52]>> It's not. Yeah.

[06:54:55]So, so entry is an exception but uh cause a break point, right? But let where was it? So, yeah. So uh when we use a single step uh in single step for when we do the single step it actually sets an flag called e-lex uh tf which a trap flag. So uh what this actually does is when you set an e-lex tf in a single step instruction and it will automatically just uh except uh make an uh exception again like but not in the breakpoint it I guess it was a debug uh exception or so. So basically so it is similar to entry but after you just uh execute the push RBP and it will goes through the uh debug flag and you will get uh control to the cable propender then eventually you could call the uh post handler that that is the single step mechanism. It triggers the uh ex exception again to pass the execution for the post handler. Yeah.

[06:56:19]>> Hi. Uh so I was wondering so you can place this int3 basically everywhere.

[06:56:27]Can you then uh redirect uh to an ebpf program from there or can you do that only for defined hooks?

[06:56:36]So uh I I kind of an uh use the word could use everywhere but not actually everywhere. So so there are some places with the kernel that doesn't allows the decode could be attached. So, so with the everywhere point it is actually not but with the uh BPF uh are you saying that like can K probe can trigger the BPF program again like some kind of a question like that?

[06:57:07]>> Yeah, basically whether you can uh um whether you can uh trigger an EVPF uh program from in in free or K Pro. Yeah.

[06:57:19]So the so the the the mechanism here just uses the uh trace event kernel trace event mechanism. So basically the VP tracing program that I've covered here is triggered based on the when the kernel event triggers. So you I'm I think you're asking that is there any mechanism that we could trigger the BPF program execution instead of using the c uh the kernel tracing events right?

[06:57:47]>> Yeah. or maybe through the kernel tracing events if if if they have a hook that would work too I guess.

[06:57:53]>> Can I can I can I try to repeat the question?

[06:57:56]>> I I haven't answered the question.

[06:57:57]Sorry. So there is a BPF program type krop. So yes you can attach a and unfor and there are some functions in the kernel which are annotated with uh not traceable. That's where you cannot attach a krop but basically if you can attach a krop to it you can attach a BPF problem.

[06:58:14]>> Thank you.

[06:58:15]>> Sorry can I Yeah, there were other people queuing up.

[06:58:18]>> I don't care.

[06:58:19]>> I'm true. Anyway, >> no, it's um regarding the answer to uh >> Jit indeed. I believe in your talk you you're you're saying uh you're explaining that there are event source sources and then there are things you can do >> upon those uh >> sources >> sources and one of things you can do is run an a BPF program for example action action >> and uh any of those sources >> Mhm. can cannot trigger any of the action more or less. Well, maybe not.

[06:58:59]Okay, let's do this. Many of those sources can trigger an a BPF pro program >> since it is attached to the handler. So, it is automatically uh triggers, >> right? And and you can attach as Nikolai explained a BPF program to a K-prop.

[06:59:16]I think >> you could attach a BPF program to Kro just like this.

[06:59:20]>> Exactly. M >> yeah I think is that is that the >> well >> maybe >> uh Nicholas explained it well enough to me so >> right >> okay great >> all right um there is for BPF programs an attachment type called a raw trace point and it's kind of nice because it gives your BTFS a type information on the arguments do you happen to know if there is any other difference uh between a raw trace point and a trace point >> so uh uh with the so this is actually the um so when the trace point tries to um records the trace data it has to uh not sure the exact details or behavior but uh the ro trace point just like dumps the uh registers or this specific format but the the no more trace point like does the some of the >> I'm not pretty sure I don't know much about it. Yeah. Sorry.

[07:00:28]>> I have an opinion. I have an opinion.

[07:00:29]Can I try to say maybe you also have opinions?

[07:00:33]>> Let me try for go.

[07:00:36]>> Yeah. So for row trace point you only have the register values uh and pointers but you don't have the content of the pointers. So for example if you have a >> user space buffer you won't have the content from the raw sys. what >> you can access >> the sys trace points but not the raw sys calls >> you only have the pointer value which is completely useless right >> sometimes you don't care about the referencing >> yeah exactly yeah sometimes >> the way I understand road >> and it's faster you're right >> so the way I understand road trace points is that and you will correct me if I or anybody for for in if there is trace point there is also a row trace point like behind it.

[07:01:25]But if there there are some points in the code where there's there's only a rotary point and there is not the refined trace point right and you see what like if you have a trace point there's also a rotary point but sometimes not the opposite and that happens when the maintainers doesn't want to put maintain the trace point in there because trace point is like oh >> uh yeah I can't take it out but ro trace points without an actual trace point is like a a stealth one >> you know that it's not exposed only the people in the now I for example I know there are some in the scheduleuler but there may be others that I don't know about >> also just like another remark basically ro trace points are much faster because the colonel doesn't do any pre-processing of the arguments so use ordinary trace points unless it's in some really critical place where you see the performance is not u is not suiting you >> yeah If you use the trace points because you need the data. So, oh yes, I have this. It's fast, but there's no data.

[07:02:29]Yeah, but I kind of need the data. So, it's there's no I don't know. It depends on the use case of course, right?

[07:02:36]>> Uh so, two small questions. One, when you were benchmarking K probes versus trace points, etc., did you have uh K probe jump optimizations enabled?

[07:02:49]And uh for the second question, BPF is quite elaborate.

[07:02:55]Um does it have overhead when you're using BPF versus native filtering?

[07:03:03]>> So to answer with the second one is BPF has the JIT compiler in in so so it could jit to the target architecture instruction. So I think it's pretty so to compare with the how with the kernel modules performance and the the jited BPF performance I do not think that that much difference and the first one uh you mean the k probes on the f trees ah kros on the f trees so you mean that did I use the config k probe on the f trace for the benchmark.

[07:03:49]>> I mean there's a config opt probes or something where the kernel will optimistically try to jump directly instead of trapping 4K probes.

[07:04:03]>> I guess I so I'm never aware of that. So yeah.

[07:04:08]>> Yeah, it's external config and then you you uh it's enabled by default. You can to turn it off.

[07:04:14]>> Yeah. in in every Linux kernel.

[07:04:16]>> So, so maybe he was used if it's on by default, maybe he was using it. What do you think? I don't know.

[07:04:22]>> Maybe >> I was curious about the actual int3 uh performance if you had it.

[07:04:31]>> So, uh >> optimiz you search for K probe jump optimizations, you'll find it.

[07:04:40]uh so it means that basically if you try to attach krop at the beginning of the function then it's basically using the f trace uh uh yeah the f trace machinery so it attaches basically frace call back that cause the krop call back >> yeah yeah yeah yeah so it doesn't have to do that uh uh step stepping ing and inry and so on because inry is slow because basically there is one uh place when you have to basically decide who is going to handle this entry and like that cape has to register that address and he has to look where it was called from and it has to see okay it's I'm in three is on this address so I'm going to use this call back and so on so it's slow in compared with just the jump or call which is used by F trace >> and and K probe can use the jump thing when you are when you are attaching K probe to the beginning of the like to the same place when when the F trace is like but also we got the mic too late.

[07:06:10]>> So it's a different option than the config uh K probe on the F right >> or is it the one that you're mentioning is it config >> two different optimizations >> I'm going to repeat two different optimization one is the one that Pedro mentioned which is jump uh K pro optimization and the second uh the name of the second optimization that you mentioned P >> the one that Pedro mentioned And the other one is >> oprobes.

[07:06:41]>> I've never So yeah, >> the same as mine.

[07:06:44]>> So how many optimization there are? One or two?

[07:06:51]>> I'm only familiar with >> Okay. There's this particular optimization that >> Yeah. I was just aware of the conference.

[07:07:02]>> Yeah.

[07:07:08]So I know absolutely nothing or like very little. So I'm just reading the documentation here. It seems like the op probes uh is only working if you set the cctto debug cap optimization on.

[07:07:26]>> So maybe not.

[07:07:29]>> It's not compile by default, but you still may need this is So, uh, basically the >> by default it's on.

[07:07:50]>> Oh, I see.

[07:07:55]>> Okay. So, compiled and enabled.

[07:07:59]So for your information just does this is treat uh benchmark could be uh run with the tools testing self test BPF. So it has a bench co run bench trigger and it will gives you the all the uh benchmarks with the base to the raw trace point and the just trace point f entry fxit kro k k k k k k k k k k k k k k k k k k k k k a k K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K Return probe or so.

[07:08:23]You could just try it yourself. But yeah.

[07:08:29]This is not the current.

[07:08:31]>> No, no, this this one is the from the uh the self test BPF. So, so yeah. Yeah.

[07:08:36]So, they has contains the benchmarks.

[07:08:38]So, it just give you the sources.

[07:08:41]>> Yeah.

[07:08:45]>> Yeah. Oh, sorry.

[07:08:51]>> Any other questions?

[07:09:00]So far the uh k probe f trace uh and the perf event seems to have some APIs that like if you're writing a kernel module you can hook into um but is there something similar for trace point like say I I have a k pro kernel module that wants to trigger a handler on trace point. Um, can I register such a thing?

[07:09:29]>> So, so uh are you saying like um could we uh write an consumer that can be attached to the trace point?

[07:09:37]>> Yes. Sort of like how you attached the K prop.

[07:09:40]>> Yeah. So, uh never try with the uh writing a consumer for the trace point.

[07:09:47]But yeah, so uh as far as I know, yeah, there is a one for uh you could write a kind of model maybe maybe I'm not sure it is exposed to the the current model or not, but I believe that you could do so. Yeah.

[07:10:23]Thank you.

[07:12:55]test one two. Hello. Hello.

[07:13:00]>> Yeah. And in any chance don't go close to the uh just because there is a you know >> I try to stay here. Okay. People should speak on the microphone. So we have it on the record.

[07:13:11]>> Yeah.

[07:13:12]>> And uh we start in time which should be right now already.

[07:13:19]>> Yeah.

[07:13:19]>> Kind of. Yeah.

[07:13:20]>> Kind of Thank you. Um, welcome to yet another session on this very successful Soua Labs conference 2026. Um I will be talking about um virtualizing SAP workloads with through the KVM. Um we'll skip that. Uh for those who do not know me, my name is Anna Wolf. I'm a SAP solution architect in the SAP core engineering team um at Souza. Um the agenda um I will be uh this is the first time I'm doing this presentation so I'm a little bit um unsure about the timing. I will rush through the first two topics which are SAP virtualization and road map and outlook and then I will try to take some time for the third topic achieving optimal performance. um hoping to have some time left at the end of the session for feedback and Q&A.

[07:14:55]Um SAP virtualization with S through the KVM with a focus on Hannah. Um to my knowledge, uh these are the current options that a customer has. Um the leader is is VMware. Um Nanx is still in the play. Redhead not so much and then there is the best option um through the KVM um in through the Linux enterprise server for SAP applications.

[07:15:30]um about SAP uh applications about virtualizing SAP applications in general. Uh there are two SAP notes you may want to read in this context and or if you try if you want to read more about virtualizing SAP Hannah with Suz KVM. There are a couple more um SAP notes um that you want to read um not going through them. SAP HANA is a little bit special. I will talk about this in extent um a little in a little while just for completeness. Um SAP HANA SAP Hannah hypervisor web validation is a little more more ex um extensive uh than um the the general hypervisor validation as we have to uh prove performance. Um a hypervisor validation usually no always consists of um sorry is only valid uh for a um certain hypervisor version a CPU architecture a number of sockets the maximum amount of memory uh three different storage types and a validation scenario. The validation scenarios are being shown on the right side here. Single VM scenario is obviously the first one to validate.

[07:16:57]On top of that, one can validate multiVM scenarios, little D for creation. Um, high availability on hypervisor level and scale out scenarios.

[07:17:09]Um, what's the big deal?

[07:17:13]Well, it's obviously always about customer satisfaction. Um we want to make sure um that uh a customer can calculate the risks and p predict the performance that um is um associated with virtualizing SAP Hannah on top of through the KVM. Uh I talked about the hypervisor validation already. Um it's always a comparison to to bare metal. Um we'll be talking about this a little bit more soon.

[07:17:47]What we gain from that, the challenges and the gains from that are basically the same. It's very hard to transport near bare metal performance through the abstraction layer. Um but by doing that um we create a very well designed um very specific and very welldefined um environment that we can give to partners to customers um to give them the possibility to create validated virtual machines to create a validated environment.

[07:18:25]Um usually there are four benchmarks that uh we have to use uh for the validation.

[07:18:36]Not going through them. One of which the first one is a storage test that has absolute values. The rest are um database um are running on the database and on all three of them we have to show that the virtual machine in comparison to a bare metal system of the same size and the same architecture um is close to the the performance deviation between the both worlds is not too big.

[07:19:09]A hypervisor configuration from birdview looks like that. Hyper threading is enabled. We use CPU pinning. We make the virtual machine numa aware. We can use um slightly above 90% of the physical memory. And we are doing um um PCI pass through on fiber channel storage and direct attached storage and SIO on networking to achieve the performance that is asked of us.

[07:19:42]Details about what I just said very very quickly is in our soua best practices best practices for SAP HANA on KVM guide um which is available on documentation.susa.com susa.com. Um, as you will see soon, um, currently we have validated SL15 SP7. The guide that is currently available is only SL5 SP5 as we are still working on the documentation on of the latest one road map and outlook. Um, this is where we are currently.

[07:20:20]Um just recently as I said we have validated 15 SP7 on Cascade Lake and Sapphire Rapids systems. Um we are currently working on Emerald Rapids and Cranite Rapids simultaneously.

[07:20:34]Um the two lighter boxes on the right lower side are what we like to do next. Um but they are not yet in plan. Um but if everything works out as we want to, these two would be the next um validations that we're doing after emerald rapids and and cranite rapids.

[07:21:01]Same thing um just just a different layout achieving optimal performance. Um I will be talking in this chapter um what we do um in detail to achieve the performance we need. Um but I also want to talk about a struggle within um that more of a customer or partner you who wants to actually use um virtualized SAP Hannah um to make this audience here aware of a problem that we see in the field. And um I personally hope that we all can work together to circumvent or eradicate this problem. Um going on chasing performance. Um general pain points of virtualization um in high performance low latency environments um is as the name already says latency on CPU level on network level on storage level. um CPU on CPU latency. We are not talking about um unpredictability with activated boosters, but also what comes what what what what latency deviation comes in with emulation for example on network latency latency that comes in with um with virtual network environments.

[07:22:40]storage latency. Um, same thing if you do virtualize the the storage in some way, you have effects that in our case and in this environment would prevent us of a su successful validation. And this is already talking about the pain, a pain that we don't feel as the people who validate, but um customers and partners feel.

[07:23:09]um about CPU pinning. I don't want to go into too much detail here just to um visualize what what what CPU pinning is.

[07:23:19]And um just as a side note, all these pictures that are on this slide and the next three or four slides I have generated with AI. So I have tried to um be as accurate as possible, but if there is an error that you see, please tell me. um CPU pinning. Um the performance way is to pin virtual CPUs to actual actual physical cores or their their hyperthreads or the flexible way is not doing that. um uh the the the host good utilization would be very much better on the flexible side um versus the performance being very much much better on the performance side. Um same thing goes for numa topology cloning that we do for multi- uh multi-node VMs um in a environment that is uh laid out for performance you would configure the virtual machine in a way that it is aware of the numat topology underneath multi socket servers have um by definition multiple numa nodes. So you would clone that topology to the virtual machine, make it aware, keep the passes from the virtual CPU to the physical CPU as short as possible. Keeping the passes from uh virtual memory to physical memory as short as possible to um get as much performance out of it as possible versus the flexible way. The hypervisor is in charge of scheduling everything, placing the memory wherever it makes sense in a certain in a at the at a specific moment in time and so on, making the whole system. um the hypervisor much more aware of what is happening in the virtual machine and scheduling it as as needed. In a multiVM environment where uh latencies may go up and down uh this is the way to go.

[07:25:43]Um emulation is probably uh quite clear. Um again performance was flexibility. If you clone the CPU architecture into the virtual machine or give the virtual machine direct access to the CPU, you skip the layer where a a possible translation of the instruction sets have has to take place which um again performance. This one is more flexible making things like life migration and so on. and so on very much easier um but brings latency into this system.

[07:26:29]Um just for completeness this is how it looks like in the XML file. Um uh CPU pinning you see that we um always pin two virtual CTPUs to the um to a physical CTPU and it's um hyper thread. This is a emerald rapids two socket system. So 56 cores per CPU equals 212 um physical CPU cores/hyper threads.

[07:27:08]That's why this is um the uh this is um the second um virtual CPU is pinned to the 112th um physical CTPU core which is actually hypers thread. Um on this side you see uh the new mind no emulation config. We are passing through the host CPU. We are creating a numat topology inside the virtual machine that is exactly like the physical server is in this case. This is a single VM scenario. The VM is has uh the ex more or less the exact same size as the the physical system. Underneath we are telling the system which CPUs are uh which CPUs belong to what NUMA node.

[07:28:01]The memory that that is available. We even tell it um that there is a longer distance to the second numa node than it is to the uh to itself.

[07:28:13]Again, just FYI, um network latency, I go through this real quick. It's the it's basically the same message over and over again. Um you always have to make an on andoff decision between flexibility and performance.

[07:28:31]Here it's not exactly on and off. Off would be here. I choose flexibility. uh choose all the um the options and um opportunities that a virtual network gives me. um uh versus I choose performance um which basically is attaching a physical uh Ethernet um device directly to the virtual machine either by uh PCIe pass through or um with virtual functions.

[07:29:08]Don't please don't uh put too much attention to the to the uh numbers here. Um, these are suggestions by the AI tool I used. I have not um proven that, but from my experience, they're not too far off of what is actually happening if you choose between these three uh possibilities.

[07:29:33]Storage latency. Um, and this is the last one where I again show the flexibility versus performance. Um, it's all just to prove a point.

[07:29:46]filebacked storage, Qcow 2. Um, again, the flex, most flexible way, um, with a lot more overhead, but makes life, uh, easier from an administr administration perspective.

[07:30:01]Um, attaching block devices directly is kind of a middle way.

[07:30:06]Not best of both worlds, but um, a a compromise between those two. And the third one being a direct attached NVMe. Again um on PCI level um attached to the VVM the um actual discs disk or discs disappear from the host system and are managed um via IO MMU directly inside the virtual machine.

[07:30:37]Summary again.

[07:30:40]If you want to have maximum performance, you need to pin your CPUs. You need to clone your new Numa topology. You cannot use a CPU emulation. you need to PCI pass through your storage either the NVMEs if you use local storage or the H or HBA also local storage or even a network card or fiber channel controller that then accesses um a network storage of some kind. Same goes for the network.

[07:31:11]And to stay on the maximum flexibility side, you don't do all of these things.

[07:31:16]You emulate your CPU. You use fileback storage, you use virtual network, you use a virtual network, you possibly do not create uh enumerous setup of any kind and you do not pin your CPUs.

[07:31:32]The trade-offs um well they are clear. Um on the performance side you have a very much reduced host utilization of course because you're pinning you're the dedicating CPUs and memory to that one one virtual machine only if that one virtual machine is is idling the physical CPU cores and the memory is is not doing anything and is not available for other workloads on the same machine.

[07:32:02]um life migration um is hard. If you do all of that, it's impossible at least to my knowledge. There's no way to scale.

[07:32:17]So um you define your virtual machine and that's it. That's that's what you get at at least as long as you um don't shut it down and you have no way of doing any uh snapshots on hypervisor level. Thank you. Um the trade-offs on the flexibility side is latency. That's the whole point. Lat a higher latency on CPU and memory level, a higher latency on network level, a higher latency on storage level and jitter. So you cannot even say okay that's the level of performance you're going to get. You just have a wake um um assumption about the performance you're going to get depending on what other workloads running in other virtual machines on the same host.

[07:33:10]And that's what I that's the message I want to transport. The customers and partners need the best of both worlds or at least something in between from another angle. Looking at the summary again and on that um double arrow we are validating there. We are not on we are not on the max performance side.

[07:33:35]There are a few little things we could could avoid doing um in order to get the validation done.

[07:33:42]Um but most of that is necessary to create a virtual machine that um keeps up uh that that that that brings the performance that we need for a hypervisor validation and that is a rough estimation of what the customers and partner partners ask for. Um we had a lot of talks lately um with partners with customers or possible customers and possible partners who sometimes are uh ask for something like this or this or this.

[07:34:26]Some are who who are accustomed and maybe even like pain could go here but they never can go there because they need backups. They need even if they can avoid life migration they they they need some way of managing um the virtual machine on top of the host and um to scale or whatever they they it's just not practical to go to that red dot there. So what they do those those configurations that I have seen that are currently already in place well they just go with something in between and um are prepared that if something happens they may not get full support for what they are doing which is a risk not every customer is willing to Um so what I propose or what I my ask is here is can we reduce the size of this gap. So between what the customer the partner needs what the market needs and what we are actually doing what we are validating and what we are telling our partners and customers is validated and supported by that.

[07:35:49]The this here is just me giving some ideas where we may be able to narrow the gap. Optimize code for performance. I'm no developer.

[07:36:07]Maybe it's not possible to optimize the code. Maybe we are already at the performance pleading edge here. But if you if there is a KKKVM developer in this room or on the video, have a look storage performance. If you can boost storage performance um with filebacked storage in KVM, that would be I I I would be a great fan of you for sure. um management management solution and I don't by that not only mean that we should have a management solution to manage bigger KVM environments but this we need as well but what I've seen um at competitors is that they deal with some of the complexity on top of thank you um on top of a management platform from um something like a bullet point where you can check or uncheck if your virtual machine is a performance machine, if it is a low latency machine that does all that CPU pinning and all that numa node topological cloning somewhere underneath in best case even can do some of it live. I know that you cannot do numa numa node config declaration live but CPU pinning you can do live for now customers and partners may need to run their own scripts develop their own stuff to have this done which could be done in a maintained delution by us.

[07:37:57]Third point, agree on more flexible, substantial or scalable support statements.

[07:38:02]Currently, there is a common a a a combined support statement from SAP and Souza about this virtual machine. This hypervisor environment is validated.

[07:38:13]This is what we are supporting.

[07:38:16]We support more. We don't usually block a customer that comes to us and has a problem. But still from a legal perspective this is the support statement. Is there a way either we as a company ourselves go out there and say hey customer hey partner we are supporting you beyond that point or even better a support statement between SAP and Souza together saying hey customer this is the validation this is the supported environment but if you do it differently we still um uh give to a guaranteed support for this and that or whatever. This is very wake.

[07:39:05]It's it's again just my ideas on how to make our customers and partners happy with uh through the KVM and especially SAP HANA on through the KVM.

[07:39:21]Okay, I talked very fast. I'm sorry, but I needed it because I'm already out of time. Um, 3 minutes left for feedback and Q&A.

[07:39:32]Thank you.

[07:39:41]Uh, so one question, why wasn't SL 15 SP6 validated? In all your slides, you had SP 5 and SP7.

[07:39:48]>> Yeah. Um, the why I'm going back to the slide.

[07:39:54]The why is um straightforward. SP7 has the longer um longer um support cycle.

[07:40:04]And when we started validating these um SP7 was already out. So we u made little to no sense to do it with SP6 and then SP7. So we decided to go directly from SP4 or SP5 to SP7.

[07:40:23]I'm wondering whether the case with multiple VMs per hypervisor whether that wouldn't be more flexible in the sense that you could uh bind the C the vCPUs of the VM to a single human node the question is what about the memory would you be able to make do with just the memory of that new node or would you need the memory also of the other new nodes to learn some more because Hannah Well, but if you run four VMs on that host, if it had four sockets or VMs, you would part partition it equally. That would give more freedom. You would you would not have to have strict pinning.

[07:41:03]Would that be meaningful for customers?

[07:41:07]>> Yeah, it's depending on the use case.

[07:41:09]Um, from a validation perspective, we only can if we validate multiVM scenario, we always validate in um uh amounts of sockets. So you can use one socket or two sockets. You cannot use half sockets for now. There there is a validation for that but we have not done it yet.

[07:41:29]>> Right.

[07:41:30]>> But to answer your question in that scenario you cannot break out of the of the um configured socket. You so you cannot use one socket um CPU wise and then use two sockets memory wise. said that that that wouldn't work or would not be >> it would perform poorly.

[07:41:52]>> It would that's that's additionally but it would also not be uh supported but you can run four virtual machines on a four socket server if multiVM is um is validated.

[07:42:06]>> So and that does not strike the balance between uh the performance part and flexibility part. It's still too far towards performance.

[07:42:15]Um, we have validated multiVM scenario.

[07:42:17]So, so this is possible. You can run two VMs on a two node system. You can run four VMs on a four system.

[07:42:25]>> Yes, I'm questioning that. Okay.

[07:42:27]>> Okay. Sorry, I didn't I didn't get it.

[07:42:30]>> We are done, right? Okay. Thank you again.

Ähnliche Videos

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

Trends

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Politikwissenschaft

My response to the Police

RecklessBen

1496K views•2026-06-01