Install our extension to search inside any video instantly.

[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)

Added: 2026-06-17

191 views08:55:44acmsigplanOriginal Release: 2026-06-16

Metric learning addresses the fundamental limitation of classification models by ensuring distances in feature space correspond to semantic similarity, where objects of the same class cluster together and objects of different classes are separated. Quantum geometry provides an ideal metric space for this purpose because it offers exponential dimensionality (2^n for n qubits), native distance calculation through measurement, and continuous differentiability through unitary operators. A qubit's internal state is a 2+1 complex vector (alpha, beta) with |alpha|^2 + |beta|^2 = 1, representing a point on the Bloch sphere. Quantum gates are unitary matrices that rotate states while preserving length and angles, and all quantum operators are infinitely differentiable, satisfying the requirements for effective metric learning.

[00:52:10]computing at the Arizona State University um where he established and heads the make programming simple lab.

[00:52:16]Uh he completed his PhD and u MS in information computer science from the University of California Ivine and bachelor's in computer science and engineering from IIT Delhi. Um so his research is on making programming simple for embedded and cyber physical systems and as I mentioned before many of you may know him because he's been very active in LCTS and related events. So uh Avidal the floor is yours.

[00:52:43]>> Thank you Heronimo uh and thank you JJ for uh inviting me here. I think it's my true honor to uh present at this uh location.

[00:52:57]So um I know this is a conference on embedded systems and I am talking about quantum computing. So this may seem like a mismatch but uh you will see as we go through this uh this talk that actually uh quantum systems are inherently embedded very resource constraint and embedded in that sense. And so we can apply a lot of our learnings and technologies to to this domain.

[00:53:28]Okay. So uh let me start by first uh thanking all my co-authors here who have who are the real brains behind this work and I'm just the face of them. So I hope I can do justice to them this work and to you guys who have uh thankfully woken up early in the morning and come to the talk. So thanks a lot for being here.

[00:53:52]Okay. So this is the outline of my presentation and uh essentially I have broken this down into two parts.

[00:54:03]The first three sections are kind of tutorial style and so they should be really easy to follow and uh and the next three sections are a bit more researchy. Right? So if so the first three sections I really want you to understand completely and if you have any questions please feel free to ask this uh small workshop style nature of this conference allows us to be interactive right. So please feel free to stop me anytime and ask a question.

[00:54:35]So in my mind if we just understand the first three topics I think we are in very good shape right so that would be a great achievement so with that let me start so I'll start slow in uh and with the background that at most people have uh at least designed some ML models right so if you designed even a few ML models most probably you have made a classification model right so classification models are the ones that you give a dog image and cat image and it can tell you whether it's a dog or a cat so how do they work uh so it there is a a neural network which is essentially just matrix multiplications and then layers of some nonlinear functions right like soft max or relu and then uh The size of the input layer is equal to the size of your input. So for example for this this is for MNEST right. So for MNEST has these images of these numbers. So they are 28x 28 pixels. So that is 764 uh values. So the input size input layer size is 764 and the output size is equal to the number of classes that you have.

[00:56:01]Right? So there will be 10 outputs and uh so after this computing uh you will get 10 outputs and then we run a softmax layer on top of them and what that does is that it takes these values and converts them into a probability distribution. So the order the relative ordering of the data remains the same.

[00:56:21]The higher number is still higher and lower is still lower but they just start adding up to one. Right? And then on this we compute this cross entropy loss.

[00:56:34]This is the key for classification models. Right? So the cross entropy loss is the loss function is the one that actually drives the whole training and what does the network learn.

[00:56:47]So cross entropy loss function is actually just trying to check. So you generate 10 values but uh and each value says like what is the probability that the input data is zero or one or two or three to nine. So the cross entropy loss function forgets about every other output. It only looks at the output of the correct class. So if you gave a three right the correct class was three then it looks at the output of the three class and uh one minus that that value is essentially the loss function. So what you want is that if if your probability that if your network is predicting that the probability that this is a three is 0.1 that means you are very far away from truth and what you want is that the probability should be close to one and then the loss is low and then your network has learned. So essentially this is the basics of uh a classification model right. So no questions till now right.

[00:57:55]So this works well for classification but there is a lot in terms of geometry that is left uh to be desired and this is a crucial point here. So one is uh if you can see at the top right diagram uh uh cross entropy loss function uh what it does is that all it is trying to do is to move is to create boundaries between these data points uh so that the data is in the right region. Right? So that is one way to think about it. But actually the right way to think about it is that uh this cross entropy loss function is mapping data from one vector space to another vector space and so it is it every layer of this neural network keeps mapping to a new vector space until those points are in the right region. In fact for a cross entropy loss function the boundaries are already predecided. So if the probability is greater than 0.5 then it is that class right. So actually the cross entropy cross function is not trying to define boundaries. It's trying to map the data to the right place and that's what the weights mean right the weights are actually transforming from one vector space mapping data from one vector space to another vector space. So I think that is that is clear. So uh the cross entropy loss function has a major limitation that actually it only cares about moving or moving the data to the right region. it has it has no concern about where the data is in the region, right? And this creates some weird problems in the sense uh that actually uh a a yellow point which is from uh one class can be closer to a green point than to another yellow point, right?

[01:00:05]Which in a sense which doesn't make a lot of sense because all of those yellow diamonds are one class, right? for example number three while greens may be number one. So a three should be closer to three than to one. And this concept is actually the key of metric learning.

[01:00:26]So metric learning is the whole idea of metric learning is that distances in the feature space should correspond to semantic similarity.

[01:00:39]And that's the key distinction that we want to achieve. And what we would like to have is a space like that on the right one. Geometrically we would like these points to be arranged like in the right image where the the mappings of objects of the same class are close to each other and they are away from the mappings of another class and uh and and actually this this has a lot of benefits if you can do that. For example, one very strange bene uh problem of classification models is that the classification model actually ma divides the entire space that there is among the classes that you have defined.

[01:01:26]So if you've defined 10 classes then entire space will be divided into these 10 regions. So now what happens is that if you give any other image right it will even if it is not from one of these classes it will map to one of these classes right because there is no space left.

[01:01:43]So and uh what you can see is that what we want is the geometry on the right where objects of the same class are mapped close to each other in clusters and so there is actually space outside the clusters that is free. So if a new kind of object uh you give as input then most probably it will be mapped to an empty space and then you can detect that as a out of distribution data. So so the left side is the kind of geometry that we achieve in the latent space or in the feature space by classification models. But on the right side is what we actually desire.

[01:02:32]Right? And what we desire the fundamental key desire is that the distance between the mappings of same objects right should be less and the distance between mappings of different objects should be more in the sense that distance should reflect semantic similarity.

[01:02:53]So this key idea is called metric learning and uh it is achieved by uh this loss function and there are many other loss functions as well but this is a key loss function that uh that achieves this uh this property and you can see the first thing you can see from this loss function is that this loss function is not defined in terms of the class of an object. It's defined in terms of distances between the mappings of the objects.

[01:03:29]Right? So distance between the mapping of X A and the mapping of XB.

[01:03:36]Right? So how a metric learning pipeline works is that so this uh this special case is this triplet learning. uh in this we take a data from one class X A and then we take another data from the same class we call it XP positive and then we take a data from the negative class we call XN and we pass them through the network and this network then maps them to a latent space right so we can call them ZA ZP and ZN and then we calculate the distances between this ZA ZP and ZN and actually uh these x's should be z's right so that is clearly wrong but what how this uh triplet loss function allows us to achieve that kind of clustering you can see from here is that uh so loss uh so one term in this loss is distance between the a the anchor and the positive right so if this is low then your loss will be low. If this is high then loss is high. So that means what it what this lo cost function or loss function wants is to reduce the distance between the anchor mapping and the positive mapping. And then the second term is m minus distance between the anchor and negative. Right? So, so that means you are trying to reduce the increase the distance between the anchor and negative to at least m. So, if the distance is less than m then you have a loss. But if there the distance is more than m then actually you get a negative number. M minus d is a negative number and then your uh but because you're taking max with zero so then your loss becomes zero.

[01:05:33]So this causes a pull and push effect where the mappings of the same class get clustered together and they are clustered away from map from objects of different class. And this fundamental rule that the distance between the mappings reflect the semantic similarity works both inside a cluster as well as outside the cluster.

[01:06:02]So you can say that the green objects are more similar to the red objects rather than the blue objects. Right? 10 and in practice it would be like the mappings of cats is closer to the mappings of dogs than to mappings of elephants.

[01:06:21]Right? So okay so we want to do metric learning.

[01:06:29]Uh so how do we do this? What is the requirement to do metric learning? So let's go back to class classification problem first. So in classification problem if you want to solve the optimization then you need to work in vector spaces. Why vector space? Because vector space means that every element is a vector and you are allowed to subtract these vectors and also multiply them by a constant scaler. So you will need that to compute to do the gradient descent.

[01:07:04]Right? Otherwise you cannot compute wt + 1 minus wt right and you cannot multiply by uh scalar uh the ea the learning parameter. So to be able to do this you need uh this to be a vector space but that is not enough. We also need it to be continuously differentiable in the sense that at all places you should be able to differentiate the loss function with respect to the weights. So this kind of a space is is a a continuously differentiable vector space.

[01:07:39]Now by by definition this vector space does not have a definition of distance.

[01:07:46]So definition of distance comes from metric spaces and that's why this name of deep metric learning the metric comes from this this definition of metric spaces which is essentially a set where there is a definition of distance and uh there can be many definitions of distance. Many are allowed. The only requirements for a definition of distance is that distance should always be positive. And if you take a vector and multiply it by a scalar, the length of that vector is equal to the uh absolute value of the scaler times the length of that vector. And it should follow the triangle inequality. The triangle inequality is the one that actually gives space to structure to the manifold of the data because what it says is that the shortest distance, the straight line distance is always the shortest. So zi plus zj will always be less than the length of zi and length of zj. So this so th this so you can think of this acting acting in terms of that if you have two red dots and a blue dot and the blue dot is farther away from the red dots right uh so let's say it is at least m distance away from the red dots then what this triangle inequality does is that it creates a circle in 2D space it creates a circle around this red dot of m radius and the blue dot can only exist outside this circle, right? While another red dot could be closer. So essentially this triangle inequality is super important and it creates the structure in the space.

[01:09:38]Okay. Uh next point. So before we uh formally define the requirements of what is required for metric learning, one very important part is that is the need or mandate for high dimensions and this comes from this covers theorem and it's actually very intuitively explained by this uh experiment at the top. So what this is that you have a 2D space and you have these green points and these red points and this is very classical data set. It's called concentric circles. So in this two-dimensional uh data if you want to classify them that means you need a one-dimensional line right to to separate them. Uh but there is no one-dimensional boundary that you can create that will classify this data correctly. So what covers theorem says is that it becomes easier to classify if you project this data in a higher dimension. Right? So what we do is create another z-axis and the zcoordinate of every point is equal to the radius uh the distance from the center of that point.

[01:10:56]So if you can map it this 2D data into 3D then you can very easily now create a horizontal plane and all points below this are green and all points above this are red and essentially that is what this covers theorem is saying that it becomes easier and easier to classify data or find a separating boundary if uh you project the data in higher dimensions. So this is key. So coming back to what are the requirements for metric learning.

[01:11:32]So metric learning uh what you want is higher number of dimensions. The higher number of dimensions you can have the better the separability of the data you can achieve. You want a distance metric, right? And because as you saw the loss function actually contains distances, right? You so you have to calculate the loss, you have to calculate distances and it's a function of distances. So if you can have a quick way to calculate distances, that would be great. And third is that it must be differentiable everywhere.

[01:12:08]Right? So these are the key requirements for metric learning. and metric learning is now the cornerstone of all ML uh technologies. So for example the word toe clip right all of these are already pre-esigned uh metric learning models and actually mappings that uh that generative AI is built on.

[01:12:38]Right? So this part uh to conclude this part as in this is uh classification models are good but they do not enforce very good geometry on the latent space. So the geometry that we desired can come from metric learning models. But metric learning needs a distance function. And so it needs a continuously differentiable normed vector space.

[01:13:08]And so what you would what you would desire for good metric learning is high number of dimensions, continuous differentiability and a native or quick distance calculation function.

[01:13:24]Right. So this finishes the first part of my talk which took double the time that I thought.

[01:13:33]Okay. But I hope this part is clear. All right. To everyone. Then no questions.

[01:13:40]Okay. So now uh let me delve a bit into quantum geometry. the promise of quantum geometry. So the whole point of this section is that I'm trying to show you that quantum geome geometry actually provides very good solutions to all the three requirements of a good metric learning approach. So it gives you very high number of dimensions. It has a native distance calculation metric and it is continuously differentiable. So that's the goal we want to reach after the end of this section of the talk.

[01:14:22]Right? Because those are the three desires for good metric learning approach. So this is uh where we are trying to reach. Okay. So let's start simple again. Uh what is a cubit?

[01:14:35]Right? So uh and to understand that let's backtrack one more step. What is a bit? So bit is a piece of information and it can be in state zero and one.

[01:14:48]Most crucially uh you can think of an internal state of a bit and external state of the bit. So externally when you read a bit it is you will read it as either zero or one. An internal state of the bit is same as the external state of the bit. So if the internal state is zero you read zero. If the internal state is one you read one.

[01:15:10]A cubit is different in the sense that its internal state is different from the external state. So internal state of a cubit is a 2 + 1 vector of complex numbers.

[01:15:24]So it is alpha beta where alpha and beta are both complex numbers with a limitation that alpha squar plus beta square is equal to 1. So what is alpha square? Alpha squared means where alpha is a complex number. Alpha square means alpha time alpha star is alpha star is the complex conjugate of alpha.

[01:15:48]Right? And so then you can say that if a and what is complex conjugate? So if alpha is a plus i then alpha complex conjugate is a minus i and then you can see that alpha star alpha is a square + b square which is a positive real number.

[01:16:07]Right? So uh so alpha squared of a complex number is always a positive real number.

[01:16:17]Okay. So the internal state of a cubit is actually very very rich. It's not just binary 0 or one. It is two complex numbers. So one can be 3 + 2 i, another can be 4 + 6 i. Right? As long as they uh follow the normalization constraint.

[01:16:35]But you can see it is two complex number. Each complex number is comprised of two real numbers. So there are four real numbers inside a cubit state. So it can have infinite possibilities. I mean even one real number can have infinite possibilities.

[01:16:53]But these are four real numbers. So there are it's a very rich state. But the challenge is that whenever you read it, you will read it as either zero or one. Right? So the external state is still very limited while the internal state is very rich.

[01:17:09]Okay. So we understand that cubit is a 2 + 1 vector. Uh and like any vector space there is a basis of that vector space.

[01:17:20]So the simplest basis of this vector space is this 1 0 and that other one should be 0 1 right 1 0 and 0 1. And then you can see that any vector in this vector space can be represented as a linear combination of the basis states right and uh in quantum computing this direct notation is very popular and and actually once you get used to it you realize it's brilliant in terms of when you are operating with vectors. So it's very easy uh the this is a weird symbol which is uh so this is called the cat of zero and this is a column vector 1 0 and k of 1 is uh 0 1 and we write an arbitrary state like alpha beta as a k of s which is equal to alpha * 0 + beta * 1 right so you can see that uh so this k of 0 and k of 1 are the basis vectors and you can represent any general vector as a linear combination of these basis vectors except that the coefficients of the linear combination can be complex numbers right so that's the uh that's a bit of a complexity so then along with this uh k we also have a bra notation so bra of s is actually the conjugate transpose of the cat vector. So alpha so this s was a 2 + 1 vector alpha beta. So how so we need to take conjugate transpose. So first transpose it so it becomes a 1 +2 vector and then you conjugate it. So it becomes alpha star beta star right? So a row vector of alpha star beta star. And then what you can easily see is that if you multiply these two together right with bra first and k later then that comes out to alpha* beta star time alpha beta and that is alpha star alpha plus beta star beta is equal to 1 which is the same as the normalization constraint at the top.

[01:19:39]Right? Okay. So okay so where are we? So we understood that the states of a cubit are a 2 + 1 complex vector, right? So if you think about it, it's a 2 + 1 complex number. Each complex number is made up of two real numbers.

[01:20:04]So that means there are four real numbers and there should be four degrees of freedom, right? But actually uh because of normalization constraint one degree of freedom goes away and there is a unique property of quantum systems where a state and a scalar multiple of the state actually means the same thing.

[01:20:26]So a state s times is the same as another scalar constant c * s right. So that because of that you lose another degree of freedom. So essentially what it comes so that means there are only two degrees of freedom left that means that this uh this state of a cubit is a surface right so actually it's a surface uh all the states reside at the on the surface of a unit sphere right and you can uh write a general state as uh in this form cos of theta by 2 0 k0 and e ^ i sin thetax 2 k1. Why?

[01:21:20]Because as you can see you can see this in two ways right? One is that this cos theta by 2 is the uh so 0 and 1 are the basis vectors and cos thetax 2 is the coefficient of zero and e ^ i5 sin thetax 2 is the coefficient of 1. So this is a linear combination of zero and one right and you can also see that actually uh that because this is a point on this unit sphere right uh so uh the so the coefficient of 0 can be cos thetax 2 and the coefficient of uh one can be sin thetax two so that uh when you square this it will become one right so that cos cos square theta* 2 and plus sin square thetax 2 will be equal to 1.

[01:22:15]Okay. So the point is that the uh that all the cubits live on the sphere on the surface of a unit sphere and their general state can be written as as this.

[01:22:28]So uh most crucially uh theta can have a range from 0 to pi.

[01:22:36]And what is theta? Theta is the polar angle. Right? So the angle from the z-axis and phi is the angle from the x-axis.

[01:22:47]So theta can have a range from 0 to pi and phi can go around the entire xy plane. Right? And by choosing any value of theta and phi you can access any point in this block sphere.

[01:23:03]Right?

[01:23:05]Okay. So uh this makes sense.

[01:23:09]Uh now uh so so this is a vector space and now we need a metric of distance here right so how do we get a metric of distance here so actually uh there is a inner product that is defined on this space because this is a actually the cubit state is not only a inner space but actually it's a hilbert space and there is one more step that I skipped it's actually a projective hbert space but let's not go into that complexity but in a inner space there is a inner product that is defined and inner product of a of two vectors s and phi is actually the bra of one vector times the k of another vector so you can calculate it like this and then you can see that the inner product of a vector with itself is always one right this is we saw this in the last slide what it means is that the length of all vectors is one right and that's what that's why they are all on reside on the surface of a block sphere.

[01:24:18]So then uh but using this inner product we can define a fidelity function which is uh actually this inner product square and that gives us also an definition of angle between two states or fidelity between two states. uh fidelity. Uh so but by the nature of this actually fidelity is opposite of distance in the sense if the fidelity between if you take the if the states are same s and phi then fidelity will be one and if the states are orthogonal then the fidelity will be zero because that's how cos cos alpha works. So we can actually define a distance metric by subtracting it from one by reversing that function. And essentially with this distance function which is defined using fidelity uh distance between similar states is less and distance between different states is more right and the distance is in the range of 0 to one. That's it. So this gives us a distance metric.

[01:25:26]Okay. The next so we now know that the uh cubits are are the cubit states are actually vectors uh right in on a unit sphere. So now what do quantum operators do? Quantum operators actually just change the state. So they rotate the state. So you can think of all the gates just in terms of these three fundamental gates. RX gate, R Y gate and RZ gate. RX gate rotates the state around the X-axis.

[01:26:01]R Y gate rotates the state around Y-axis and RZ gate rotates the state around Z-axis.

[01:26:08]So there are many more gates and all that but you can think of all that the gates do is just rotate the states and you can uh rotate from anywhere to anywhere using just these three gates.

[01:26:20]In fact actually you can rotate with just two of these three gates. You can pick any two, right?

[01:26:28]Okay. And another thing that I want you to see is that uh so what are these gates? These gates are actually just matrices, right? Uh so and how do you apply uh so for so our vectors were 2 + one vector.

[01:26:47]So these gates quantum gates are two +2 matrices. And how do you apply a gate to a ve to a state? You multiply it from the left. So uh so essentially uh where is the pointer? Here it is.

[01:27:05]Yeah. So if this u is the gate which is a 2 +2 vector and s is your state then you multiply u from the left and that gives you the new state s prime. Right?

[01:27:20]And uh there are some restrictions that not any matrix can be a gate. Uh only unititary matrices are allowed. And what it uh what it means is that uh there there are two big implications of that.

[01:27:35]Unitary matrices mean that u u dagger should be identity. Again dagger what is dagger operation? Dagger is transpose conjugate. So you take a matrix you take its transpose right so make rows into columns and columns into rows and then take the conjugate so that's the so that means take complex conjugate of each number so that's how you get u dagger and u * ud dagger should be identity so the two implications of this being a unitary matrix is that one that it preserves the length of a vector so when you apply U to S you get C prime. The length of SI prime is same as the length of S, right? And this is very useful because our states were unit vectors, right? Vectors of length one. So after applying any gate, they still remain a vector of length one. So that means the state is still on the surface of the block sphere. Then second property that it has is that it is angle preserving. That means if you have two vectors you apply u on them then they both rotate together the so it preserves the angles between them right okay so this is one very important property that quantum operators only do rotation and they do like a solid rotation they don't rotate different vectors by different amounts they rotate everything by the same amount right then second important thing is that all quantum gates are infinitely differentiable. So a quantum gate like this RX gate is written in terms of sin theta and cos theta and you can differentiate it. Sin theta will become cos theta cos theta will become minus sin theta and you can keep differentiating it indefinitely. Right?

[01:29:31]So this satisfies our important property that we want a continuously differentiable loss landscape.

[01:29:40]Right.

[01:29:42]Okay. Now this is another uh slide which says how do we measure.

[01:29:50]So this is actually fascinating. So the what we saw is that the defining property of quantum of cubits is that the internal state is very rich. It's made up of two complex numbers but externally it is binary. Right? So how do we achieve that? By math right. So what happens is that all the measurement operators are actually hermission matrices.

[01:30:19]So measure when you have to measure a state, you can only measure it by a hermission matrix. So what is hermission matrix? Hermission matrix is one in which H is equal to H dagger.

[01:30:33]Right? So again what is the dagger operation? is the transpose conjugate or conjugate transpose. So first you take the transpose of the matrix and then conjugate every every element.

[01:30:45]So if a matrix has to be the same as its transpose conjugate that means at least its diagonal should be real numbers right and it should be a symmetric matrix. So this is one property of uh hermission hermission operators uh hermission observables.

[01:31:06]Okay. Now how do we measure?

[01:31:09]So if we take a operator and we uh give it a state say s. So we have an operator here z and we get a state s. Then the only values you can observe are the igon values of this hermissionian operator. So in specific this observable Z is 1 0 and 0 1 0 - 1 its igon values are 1 and minus one. So the only things that you can observe when you read any state whatever be alpha beta right whatever be s you can only read one and minus one and in and when we talk in computing we actually convert this to zero and one is converted to zero and when we say state zero that actually means that we read one and when we read minus one then we say it is state one right so but that's just the binary issue Okay. Now the other property of quantum mechanics is that as soon as you read uh measure a state right that state collapses to the corresponding igon state. So for example so now we can understand the whole measurement process. So the observable is Z.

[01:32:30]Its igon value is 1 and minus one. The corresponding igon state of one is 1 zero or zero and the corresponding igon state of minus one is 01 or one. Right?

[01:32:44]So if you get a state alpha beta and you read it you can only read one of these two values either one or zero. If you read a one the state will collapse to 1 zero or the zero get zero. And if you read minus one the state will collapse to 01. So the state so measurement is destructive in the sense right. So it it changes the state.

[01:33:12]Now so you can read these two values and it is statistical in nature right. So what do you read? Uh so you can read one or minus one but with what probabilities? So you will read one with a probability of alpha squared and you will read minus one with the probability of beta squared. So essentially to calculate that we calculate the fidelity. So the probability of reading one of the igon values is the is a measure of fidelity of that of your incoming state with the iggon state.

[01:33:52]Right? So you can see that measurement actually evaluates the fidelity right. So the probability with which you will read one is equal to the fidelity of the two states. One of them is your input state and other one is the igon state.

[01:34:12]So this again uh this is again towards our property that uh we want a matrix we want a system in which the fidelity or distance can be measured natively then this is probably the most fascinating aspect of uh uh of quantum spaces quantum geometry. So uh so what does the state of a multiubit system look like? So we saw that a state of one cubit can be represented by a 2 + 1 complex vector right. So to figure out the state of a uh two cubits we actually calculate the chronicer product of the two states. So which is the other the second vector is multiplied by each element of the vector. So you can see that in operation and then you get a 4 + 1 complex vector.

[01:35:08]Right? So and so what you can see is that uh so for a one cubit system you have 2 + 1 complex vector for a two 2 cubit system you would have a 4 + 1 complex vector for a 3 cubit system you would have a 8 + 1 complex vector and so for n cubit system you have a 2 ^ n cross one complex vector.

[01:35:34]So that means the dimensionality of your space is increasing exponentially with n. So you have 10 with 10 cubits you get a vector of,024 complex numbers.

[01:35:48]Right? And with 50 cubits you can have a state which is which has dimensions which are larger than the number of atoms in the universe. Right? So the dimensionality increases exponentially.

[01:36:00]So this is so we fulfill all the three promises of what we want in a very good metric space right. It quantum space provides exponential dimensionality. It provides a native distance engine because whenever you measure you you the probability of measuring a one is a measure of the fidelity right and it is infinite all the operators are infinitely differentiable. So you can calculate your loss everywhere in the landscape.

[01:36:34]Okay, this is a good time to stop. So any questions? Because these two are kind of heavy things although they look simple.

[01:36:46]Any questions? No. Okay. So clearly we can't finish the whole talk today. So we'll just do one more section and then stop there. Right. And I think which is also very instructive as I said I mean if we can just cover the first three sections that would be great right? Yes please.

[01:37:12]>> Uh yeah so distance is I already showed that uh equation.

[01:37:22]This is the distance between two states.

[01:37:34]That's a very good point. So actually for uh a metric space does not define a distance metric. It only requires a distance metric. And any distance metric is okay as long it satisfies three properties. One that the distance is always positive. Second is that uh scalar multiply works right and third is the triangle inequality. Any distance metric that satisfies these three properties can be used for metric learning. And this is just one of the distance metrics. There can be many others.

[01:38:15]Perfect. Thank you for a good question.

[01:38:18]Okay. So now the only other section that we'll go over is what how can we now create a quantum metric learning architecture right so this is also just the basics of how how how we can do this so we'll start from a naive approach what an ML engineer would start doing and then see what are the problems in that and see if we can solve them.

[01:38:45]Okay. So in my mind uh if if I were an ML engineer not an expert ML engineer just a beginner ML engineer then this is how I would design a quantum metric learning pipeline. So I'll take the input data but input data like this emnest image is classical right so I need to convert it into a quantum state right so that's the first step data encoding then second step would be to create a parameterized circuit right so with so uh parameterized circuit with parameters theta that we can train right and improve the separability of the data, right? And then we need a way to measure the distance, right? And then using those distances, we will calculate the loss. And then from that loss, then we can calculate the gradients, right? Del L by del theta, right? And then update the gra uh update the parameters theta, right? Does this make sense? This is how you would make a metric learning pipeline. So let's go into each stage and see what are the challenges and what we need to fix. So first is how do we do data encoding right? So actually we can use the rotation gates that we have right rotation gates take a parameter theta how much to rotate right so you can rotate by that. So typically we draw quantum circuits from left to right where left is the input. Uh then there is a gate and then the parameters come from the top and then the output is on the right side. So we can use this R Y gate for example or RZ gate or RX gate whichever gate rotation gate you want.

[01:40:40]Uh this will rotate and create a state and that state will be data dependent because it is X I X I is the parameter.

[01:40:49]The only challenge here is that the rotations have a have periodic nature right. So uh so if you are rotating for example if this was uh the r y gate then uh theta can only go from 0 to pi. So that means you must scale all your data to be between zero and pi and then you can create unique states for all your data.

[01:41:19]Right? Does that make sense?

[01:41:22]So no problem here. I think this is pretty straightforward.

[01:41:26]Okay. Second, so now we need a parameterized circuit. So now we have encoded this data and we need a parameterized circuit that will increase the separability as we train this theta.

[01:41:38]As we pick this theta.

[01:41:40]So here there is a fundamental problem actually. Whatever theta you pick, we saw that actually quantum gates are like rigid rotations. They cannot change the angles between the data points. So actually this parameterized circuit will never work. Whatever is the separation between the data will remain the separation between the data. It can rotate and all that but it cannot change the separation.

[01:42:12]So to separate the data we use a data re-uploading answers. So ANSATs is just a fancy name quantum computer people use for describing a circuit. So essentially it's a circuit right. Um and it's data reuploading in the sense that uh so there is a layer that is the parameterized layer and then there is a data layer. So there is some gates. So you again rotate using the x parameters.

[01:42:40]So your inputs. So by alternating this parameterized theta parameterized gates and data parameterized gate you can actually achieve separation of data.

[01:42:54]JJ.

[01:42:55]>> Yes.

[01:42:57]>> Yeah.

[01:43:03]>> Right. So we are doing metric learning.

[01:43:06]Right. What we want is that uh the data comes in, we want to separate them, right? So you can think of it as uh I think it may be this. Yeah. So I'm getting say maybe zeros and ones. Let's just think of binary binary, right? 0 and one. Then zeros and ones get encoded on this block sphere.

[01:43:30]If I have trained it correctly then all the zeros should come at the top and one should come at the bottom. So I want to increase the separability of the data right but if I don't have data dependent gates then you cannot change the separability they all move together. So how does data dependent gates actually enable separation? You can think of it like this. So for example uh your data is 10 and 20 in very simple terms degrees right. So the for so so the 10 one will rotate by 10° 21 will rotate by 20° right now now in the next step when you do it again then again the 10 will rotate 10 more degrees so it will go to 20° but the 21 will rotate 20 more degrees and go to 40°. So now the separation is doubled.

[01:44:26]So by interle these uh quantum parameters and data dependent layers you can actually change the separ increase the separability.

[01:44:38]Okay only five more slides left. Yes >> they are quantum gates. They are just rotation gates but they are parameterized by data rather than parameterized by trainable parameters.

[01:44:54]Right? So let me go back here again.

[01:44:58]Uh so you can see so our our layers are formed of W layers and V layers. W layers are parameterized by the trainable parameters theta right and X layers are parameterized by the data.

[01:45:15]So this that's why this is called data reuploading.

[01:45:20]Okay. Then our next part and this is where it gets more interesting. So our next part is that we need to calculate distances and we need to calculate loss right. Okay. So we said that uh uh distance in this in uh quantum geometry is the metric we are using is the fidelity right and so when we measure the probability of Z you read only one and minus one the probability of reading a one is a measure of the fidelity but the problem is that you cannot do one measurement and find out this probability finding out this probability is a statistical exercise. So you need to do several experiments. Imagine like doing 100 experiments and then you see that 70 times you get one and 30 times you get minus one and then you'll say oh pro probability of getting uh a one is 70%. And then you can say okay that is a measure of the fidelity between the two states right so that is the challenge and you have to do several shots of this execution typically more than thousand between thousand and 10,000 shots you have to do and then you read you calculate the number of experiments when you uh the output was zero and divided by the total number experiments that's your measure of fidelity right so I'm overlooking a couple of things for example uh when we when we do a measurement you calculate the fidelity between one of the iggon states of the observable and your given state so there is a swap test circuit that you can make and we can discuss this offline but essentially what it does is that given any two states it converts it projects the fidelity between these two states to the uh probability of you reading a zero in the first cubit.

[01:47:31]Right? So all you need to do is to make several observations and see how many times the first cubit is zero. Right?

[01:47:38]And that is your measure of fidelity between the two states.

[01:47:42]But so now you have to do these shorts, right? But once you compute the fidelity, you can compute the loss function pretty straightforward, right?

[01:47:54]Except you can see that in this I changed the signs of a n and a fidelity of a and a because fidelity is inverse of uh distance.

[01:48:05]So this is the correct um formula. Okay. Then uh gradient calculation, right? So how do we calculate gradients here?

[01:48:21]So uh so uh the problem is that yes the gates are infinitely differentiable but you cannot access the state inside the quantum computer.

[01:48:37]At the end of the day, you give this input X as the input and all you read is when you do the measurement, you can read what is the output state, right?

[01:48:47]So, you cannot do back propagation as we do in classical computers.

[01:48:53]But fortunately there is actually a simple way to calculate the gradient.

[01:48:57]Calculate the exact gradient delt by delta j by calculating the loss at the parameter plus p<unk> by2 and loss at parameter minus by2 and divide by 2. And intuitively why this works is because any loss function is can be actually written in terms of cos theta and sin theta. And so you can see the derivative of this is the cos is replaced by s and sine is replaced by cos and by doing theta +<unk> by2 and minus<unk> by2 also does the same thing. So you can calculate that function using these two functions.

[01:49:33]Right? So now to calculate uh the gradient you need to do two experiments.

[01:49:40]run this uh run this circuit with parameter theta j +<unk> by2 and parameter theta j minus<unk> by2 so this increases two more shots and now these two effects are multiplied with each other right so this is the final qmel flow that would make that would actually work right and and give results so essentially uh we take the data we scale that data We use the rotation gates to encode that data. Then we have a learnable circuit which is a data reuploading circuit and we run this uh for every pair of XA and XP and XA and XN. And for each parameter theta j right we have to run this and for each parameter we have to run theta j +<unk> by2 and theta j minus p<unk> by2 and uh do the swap test here and then collect the number of zeros that are that you get in the first cubit and that gives you the measure of fidelity between the encoded states of these two and that you can use to calculate the differentiate the the gradient the gradients and then you can update the parameters.

[01:51:06]Okay, so this is extremely compute intensive, right? This requires 4 into p into n shots to do one step of gradient update. So very very bad, right? Uh so n shots is about thousand and then p is the number of parameters that you have and four is because you have to do two times for each one and then you have to uh do for x a and xp and x a and xn. So I'll stop here.

[01:51:40]Uh so hopefully what you got is a insight of the that why what is needed for good metric learning? Why quantum geometry provides a very good geometry for quantum metric learning but there are many issues if you just want to do it directly.

[01:52:03]So actually the work that we have done enables us to do quantum metric learning without falling into any of these traps.

[01:52:12]So we didn't have time to go into the solution but you can read our paper and but hopefully you can re you reached here.

[01:52:25]Yeah, thanks. So, we still have questions.

[01:52:37]I don't know whether you make this original problems are relevant to clustering and also other partitioning problems of the data. Right? So is that on purpose or you uh somehow uh try to focus only in this talk based on that?

[01:53:06]>> Uh no I think uh we don't realize but actually all AI is almost metric learning is a very big part of AI. So metric learning is very important by itself. Even though we start learning from classification models >> the real learning is in metric learning.

[01:53:24]So, so that's why >> uh very good talk. I was uh very interested in this um the quantum computer. I'm new to this topic by the way. Um uh I was really looking forward to hearing more about your like research outcome. So can you just like deliver more like the sentences? So actually let me give you some highlights of our results. So actually there is a quantum metric learning tutorial on Google circ because Google makes a quantum computer and they have a programming language to actually program that quantum computer that is called circ. So there they do actually metric learning with 2x two images.

[01:54:14]So they just 2x2 image just four pixels right because they're this doesn't scale in contrast to that our approach actually we can do complete emnest we can do omniglot we can do multimodal learning all of this is enabled because we figured out a way to learn in quantum hilbert space while avoiding all these pitfalls So I think the biggest contribution of this work is in scaling quantum metric learning to realistic problems. It's not still comparable to classical learning. So that's not the approach. I think classical metric learning is very advanced and we are still learning how to do quantum metric learning. And if you look at all the papers that are there in quant quantum machine learning field, they all use like very limited data sets like the concentric circles data set or binary data sets and so on. So one of our biggest contributions is that we have been able to scale it up to at least realistic levels of problems.

[01:55:31]>> Right? I had a similar question maybe for us to get uh like a sense of how you do this first encoding. So you you you gain the dimensionality because of the cubits, right? So the amount of cubits gives you a space of two to the n, right? But then you start with an image like this one right here maybe with 256 bits, right? So an 8 by8 uh image. So you have 256 bits in the image.

[01:55:58]>> So this has 28 cross 28, right? 76 764 >> all right 28 * 28 that's classical bits >> yes >> uh but I guess you have to reduce it >> right >> into cubit so actually you reduce the dimensionality into the cubid encoding >> can you say like more or less how is this scaling more or less working so you >> so actually uh this in ultimately in our solution these images can be converted into six cubits >> six okay >> so they can be their state can be expressed in six cubits. Actually, if you think about it, cubit can hold a lot of information. One cubit could actually encode everything, >> right? Because it has infinite possible states.

[01:56:42]>> Yeah.

[01:56:43]>> Uh but yeah, I think the separation and your accuracy of measurement and all that come into >> I was wondering about like the scalability and yeah, with six cubits, I mean, you can imagine a a quantum computer with six cubits.

[01:56:53]>> Correct.

[01:56:53]>> And then maybe uh final question if nobody else. Uh so then you have to do all these rotations and and I guess there's going to be some physical constraints in the in the granularity at which you can do these rotation. Is that is that a problem at all or not or >> So that's actually a very important constraint and this is I think all embedded people can relate to is that quantum computers as you know uh they have very few cubits right so even the best quantum computer has only 50 usable cubits for now right uh and their circuits cannot be very deep because there is noise and decoherence that happens so you must compute in a very shallow circuit >> so in fact our quant encoding and the whole whole quantum circuit is just two gates in our solution.

[01:57:42]So that's that's a very important part of our solution.

[01:57:47]>> All right.

[01:57:49]So we are one minute over time but if somebody else has another question otherwise uh let's thank Aal again. Oh too late offline. Let's thank a again.

[01:58:00]Thanks. Thank you.

[01:58:28]Yeah, I I know who's giving the talk.

[01:58:30]Yeah.

[02:22:30]I'm just trying to locate this the third >> if the if either of the authors of the third paper today is around just identify yourself.

[02:22:43]Are you are you the hikami? Are you going to be No.

[02:22:47]>> No.

[02:22:48]>> No. Okay.

[02:22:51]>> Yeah. I'm locating the present. Yeah.

[02:22:55]>> This will be terrible if the mic there's there's a mic which >> they will we'll mic you up. Great.

[02:23:11]>> Yeah. No rush.

[02:23:14]I should you do you want to check if you want to quickly check if your project >> you did you did >> did you check if your things project >> oh >> uh think quickly do a check >> problem >> yeah shouldn't be a problem oh you're you're going to be the speaker okay >> yes >> all right so >> I just do a quick check uh the next couple of minutes.

[02:26:34]compiled to efficient and readable C code by Clemo Shavano and others uh mainly from uh Enria blah blah blah as and from KTH it's like four four affiliations over to you Clemo thank Thank you for the introduction. Uh is the mic working?

[02:26:57]Uh yeah. Okay. Thanks. Um so here hello everyone. I'm Clemens and I'm going to present work on a domain specific language.

[02:27:05]>> Yeah. Domain specific language that is uh aims at facilitating the verification of system code and that is compiled to both efficient and readable C programs.

[02:27:16]So this is a joint work between INRA and KTH.

[02:27:20]So the motivation behind this work is that verifying system code is challenging because uh those programs are written in low-level languages such as C for example. Um so you have to reason with uh pointerbased semantics and uh you have to maintain complex memory invariance uh when you reason about those programs. So what we usually do uh when we want to verify system code is to work with uh higher level abstractions. For example, using a refinement approach in which we are going to define several models of the system at different abstraction levels.

[02:27:54]Uh and we are going to do refinement proofs between those models which means that for a given model we are going to prove that it is a valid implementation of the IO level abstract model.

[02:28:05]Um and we have wonderful tools to do that mechanically nowadays which are proof assistants that are tools to uh that have a programming and a specific functioning language uh that uh are used to reasonable programs.

[02:28:21]However, when we use this approach uh there is a semantic gap between uh the very high level abstract model and the source code that we would like to run on the hardware. Uh so in this work we took the approach of compilation to automatically automatically derive uh source code from a model and the ultimate goal is to have a verified compiler. So this means that we want to check that the compiler do not does not introduce bugs when it compiles to source code and more specifically in our cases we uh want to start from rock programs. So work is one of those proof systems and we want to generate C codes but not any type of C code programs that are suitable for embedded systems means that fits the constraints of those systems uh memory wise and time wise.

[02:29:13]So uh rock um the semantic gap between rock and c is that rock is a very high level programming language purely functional. So it means that there is only constant values and uh it has to be executed uh under runtime. So it's not suitable for embedded systems. So that's why we come with uh our own domain specific language which is called Barack which is uh still pretty functional but very minimal. Uh and it can be seen as a rock subset with a particular programming discipline that I will talk about in a few slides. Um so to achieve uh certain efficiency we compile with in place updates. So basically we did not do copy um when we compile to the C code but values in the C programs are mutable and so that's what we call in place updates. We also made a very pragmatic choice to uh compile to readable C code uh because uh our goal is to bring uh those tools to the system programmers community. So we think that it's important to uh generate C code that can be reviewed or easily debug by system programmers. Uh finally because system codes requires fine grain control over the memory layouts uh we have some annotation in the language that allow to uh control this layout. And as a proof of concept we implemented the uh S3K uh separation canel which was originally created at KTH.

[02:30:40]Um so there is already successful stories uh in the literature uh that are using compilation to uh implement system codes for example store for cryptographic algorithms and cogent for file systems uh and uh what we wanted to do with barack is to kind of have having a compromise between uh those two uh works. So have a lightweight approach so that the language is uh uh minimal and easy to learn uh and having uh the possibility to generate idomatic C and readable C code for the reasons I mentioned uh in the previous slides and also we needed some features such as uh arrays and more specifically arrays of non-primitive values that were not supported uh by cogent for example.

[02:31:29]So the rest of this talk is divided in two main parts. The main part will talk about the bar language and its compiler and the second part will talk about the bar port of this uh micro kernel called S3K. So I will not talk about proof uh in this talk.

[02:31:45]So um to give an overview about the compilation architecture of Barack uh the first step is to compile uh the functional language Barack into an imperative one called impan. Uh so the compilation process is divided into multiple phases. uh to ease the proof of correctness of those uh of this compilation and the first uh three passes are inspired by a work uh of Bman published at UPSla two years ago.

[02:32:15]Um because combining uh functional programs with in place updates is incorrect for the majority of functional programs. We have to check that for a given program it's actually correct. So we have this staticalis that is performed on imp one that do this check and we finally introduce another intermediate representation on which we perform some optimizations to make the core the code more readable and idomatic. So those optimization are copy propagation and globalization. So globalization uh is the fact uh of replacing uh some function parameters by global mutable uh variables and sorry finally uh we generate C there's one slides missing here sorry um uh so it's very oh um yeah yes yeah sorry there is a a bang in the slide. Um uh so I'm going to talk about a little more of the language with an example. Um so this is the implementation in Barack of buffers. Um so we have a record type buffer with uh which define the length of buffers and the content. So the brackets here uh are um um are used for declaring an array of 64 and sign integers. And then we define the buffer write function uh that writes a value v into the buffer B. Uh so basically this function first checks that the buffer is full. If it's the case it does nothing and just return the buffer and if the buffer is not full it writes the value v uh at ind length of the buffer content.

[02:34:04]So to give you an idea of uh what how the static analysis work, I would like to um run a concrete execution of this program with the following um inputs uh in the two semantics. So in the copy semantics in which we only do copy and in the in place semantics in which we do in place updates which is the semantics that the compile cod have. So in the copy semantics uh the buffer B is associated with buffer value uh so just a functional value but in the in place semantics we introduce a memory uh we introduce a memory uh so uh the buffer B is a pointer to a buffer record and the buffer content is a pointer to an array of uh unsigned integers.

[02:34:52]So when we execute the first expression so we read the contents B.Contents contents and put this in a variable city. So in the copy semantics we do uh a copy of the buffer contents. Uh but in the inplay semantics what we do is that we have a pointer that is an alias to the buffer uh the input buffer content and when we do an updates in the copy semantics again we copy the content of CT and put to put it into city one and we update the value at index uh index two and we put the value uh sub Um but in the inplay semantics what we do is that we directly mutate the memory. So we replace zero here by seven and CT1 is also a pointer and is an alias to city and points to the buffer content.

[02:35:40]I can continue to execute this program and at the end we obtain those two final states. Uh so in the copy semantics there are all of these um local variables that were uh copied from each other. uh but in the inplay semantics uh B1 and BS uh are aliases to the input buffer and as I said previously CT and CT1 are alyses to the uh input buffer content.

[02:36:06]So if we look closely uh at the two states we can see that some memory access path evaluates this differently in the two semantics. So for example B.clank lengths evaluates to zero in the uh copy semantics but it evaluates to um three in the in place semantics because we did in place updates.

[02:36:25]So the static analysis try to compute those invalid path and then it checks that in the return value. So here it's the return buffer B res everything is current in the two semantics. So in this example this is a case uh so the length and the content of the buffer are the same in the copy and in the in place semantics. So we say that the in the in place semantics the program behaves the same as in the copy semantics and we accept the program however let's say that we made a mistake when programming uh this function and instead of p1.length we put uh we wrote b.length. So in this case it means that uh the uh result buffer here will not contain the updated content because it was contained in B1.

[02:37:14]So if we execute uh this new version uh of buffer right uh we obtain this final state and what we can see here is that in the copy semantics uh the value has not been updated here but in the in place semantics because we do in place updates it has been updated. So this memory path uh that as as origin B rest is does not evaluate the same in the two semantics and we reject this program.

[02:37:44]So to summarize how the static analysis works basically uh for a given function um and assuming that all of the arguments are valid um the staticist will compute this set of invalid paths that catches the incurrence between the two semantics and then it checks that the return value is associated with any of invalid path. If it is not the case then we can accept the program.

[02:38:12]So which in informally replace what the static static analyses an analysis enforces with the following programming discipline which is uh always use the most up-to-date memory path to access a value.

[02:38:26]Okay so let now let's see how this program is compiled to C. Uh so I put again the definition of buffers in Barack and the associated C type is the following. So barrack records are uh compiled to Cbuffers and uh C strict sorry and the contents of the buffer becomes a pointer to an array of ensign 6 64-bit integers and uh as in the example before the memory layout of this um of this uh allocated buffer is the following one with content being a pointer to an array. Um if we do not want if we want to remove uh this indirection here we can use what we call an endboxing operator which is the sharp operator here which say okay let's remove this pointer and flatten the buffer content inside the buffer record and uh as uh we also need to specify the length of the array uh so that it can be compiled to C and the memory layout of the uh this new type is the following.

[02:39:34]So everything is just uh flattened and we do not have indirection anymore.

[02:39:40]Okay. So now let's see how we compile the buffer write function. Um so uh I um highlight here the lines the lines that correspond uh in the bar codes and in the C code. So what we can see is that some uh assignments and some local variables are not present in the C code.

[02:40:02]uh this is an optimization that we do uh on the M2 um intermediate representation. So because in the in place semantics um um creating and reading uh modifying and reading uh the memory creates aliases uh it happens that um B1 and B rest and CT1 are useless variables. They become dead variables. So we are able to remove them and uh this is the result that we obtain after this optimization pass. So we directly uh mutate here uh the first variable CT and uh the length of the input buffer B. And also concerning the the function signature uh this is there is a onetoone mapping between the bur signature and the C signature. So uh the input buffer is a pointer in C and the return value is also a pointer uh which correspond to that uh unbar code.

[02:40:59]Um and now I'm going to talk more details about S3K which is a simple secure separation kernel. Um so it's a kernel developed at KTH. Uh it targets risk five embedded systems. Um and it's based on uh capabilities. Um so originally the implementation is in CN assembly and the goal of this kernel is to uh guarantee some security properties which are u related to special protection and temporal protection. So the the goal of special spatial protection is to isolate it each processes from each other and it enforces thanks to the PMP risk 5 mechanism and the temporal protection uh relies on first the temporary fence to flush the cache at its context switch and it allows to protect against side channels and they also add time padding to avoid the execution time of uh processes to uh interfere with the execution time of over processes.

[02:42:00]and a little bit more about capabilities. So capabilities are token and unforgeable tokens that grant access to a system resource and uh one of the core logic of S3K is to manage dynamically those capabilities. So the main two operations are capability derivation. So a capability is going to delegate its rights to children capability and the reverse operation which is a capability revocation. So the capability will reclaim the rights that it previously allocated to children.

[02:42:33]So the original implementation of Barack was about 2,000 line of C code and 200 line of assembly. The number of lines uh for the S for the Barack implementation of SVK is roughly the same. So 2,000 line of bar but we have remaining C code in this implementation. And this code uh is mainly uh used for interaction with the hardware and also some loops that could not be ported uh uh to Burke for the moment. And thanks to this unboxing operator that changes the memory layout, we achieved the same layout for the uh kernel uh tables.

[02:43:13]To uh test that the um that the the kernel the rubber implementation of S3K was uh reasonably efficient, we uh run some benchmarks on real hardware. Alo an FPGA and in this benchmark we have two processes a client and a server that are talking to each other and we took we measure the time uh that the kernel uh take to handle the message sent by the clients and uh the time that it uh took to handle the me the reply sent by the server. Um so in this setting we have um approximately 3% overhead introduced by by the bar code for the runtime trip which is the sum of those two times.

[02:43:58]We also uh rerun those benchmarks by uh adding a temporal fence uh just before sending uh the message for the client and the reply for the server. Uh so basically the temper fence pushes the cache and uh in these settings uh the overhead uh is higher and is about 8% for the runtime uh chip.

[02:44:21]Okay so let's now conclude. Um so what we uh did in this work uh is that we have a new uh domain specific language that is embedded in rock and that be compiled to C. Something that I didn't talk in this presentation but if you're interested you can read the paper is that we define a semantics for barack and we prove that the semantics is sound with respect to um how the rock uh interpreters interprets uh the bar programs. We have a compiler that generates human readable and reasonably efficiency we have an implementation of a state-of-the-art um separation kernel.

[02:45:01]And for the future works, we would like to uh dig into some proofs so of the compiler on S3K and adding some more features to the language and finding more cases for Barak. Thank you for the attention. I will take some questions.

[02:45:15]>> Thank you. Are there any questions?

[02:45:22]>> Thank you. Andreas Kio Dad. Um I I have a very minor question on this but we are struggling with the same topic in my group currently. So I'm wondering why did you compile using uh just - O2 instead of 03?

[02:45:37]>> Um it was an advice of one of the quarter that say uh it's safer to compile with O2. So but uh yeah >> but it's a performance difference. O3 will run faster quite quite a lot faster in some cases.

[02:45:54]>> Yeah sure. But um so I'm not the compiler C expert and so we reused uh how the previous benchmark on S3K was done and it was compiled with O2. Um but uh yeah I think the main reason why is was yeah it's was not that uh an advantage compared with O3 from what I understand but uh yeah >> okay thank you the reason >> thanks so uh your position is for micro kernels right but you have to also deal with assemblies so why is Is it not possible to translate your uh barock directly into assembly if you have a specification of assembly?

[02:46:42]>> Um so what you will do is to um define abstract function in Barack.

[02:46:51]So a function that do not have an implementation in Barack and that serves as the interface for uh the imple assembly implementation of the functions. Uh and does that answer your question? So let me yeah >> the answer is we already have a verified compiler from C to assembly.

[02:47:10]>> Yeah.

[02:47:11]>> Oh okay that's >> ah okay okay so let for everything can be written in bar including the uh all the stuff relevant to assemblies if I get it right. Yeah I mean the thing that are assembly codes remain assembly codes.

[02:47:28]>> Yeah >> it cannot be written in barack.

[02:47:32]Um >> so if I want to write a context switch >> for example then I need to write some assembly right yeah >> that's what we do nowadays >> but can we also do some stuff in bar uh for context switch >> no so for example context CPC will be written in assembly yeah >> I think my question was was answered already I wanted to ask about what is the subset that you compile to and if that is the subset corresponding to concert but I think that's you're using concert Yes.

[02:48:03]>> All right.

[02:48:04]>> So the the subset of work, right? Or the subset of C. Yeah. Um so we are targeting C syntax the source language of concerts and um I mean the subset is using the records I mean structures uh enumerations uh if then else switch statements and uh pointers and uh yeah will take you So I had one last quick question. Your the style in which you have to write your functional code has to look pretty much very idiomatically like C. It's this continuation of let let let let let yeah is what if you wrote it differently? Would you get very bad behavior?

[02:48:49]>> Um inefficient behavior. So what you will have is just uh not a onetoone mapping between the bar and the C code but it does not affect the performances of the of the code.

[02:49:01]>> Okay.

[02:49:01]>> Um yeah >> great. Our our next talk is on a pointer ownership model.

[02:49:07]>> Yeah. Let's give a round of applause.

[02:49:09]Sorry. Yeah. Thanks for um so it's a pointer ownership model for C. It's inspired by Russ David and uh his colleagues at Carnegie Melon Software Engineering Institute. David's going to give the talk.

[02:49:29]>> I'm going to give you a ping around 7 minutes before the end of the so that goes on the belt on him.

[02:49:52]Okay. So, uh I think I'm miked up. Good.

[02:49:57]All right. So, I'm here to talk to you about uh pointer ownership model for C inspired by Rust.

[02:50:03]So, I'm going to start off with a little personal anecdote. When I was a student, I first learned about formal models and I thought they're interesting academically, but they're tedious and expensive and uh well, they take a lot of effort to do So, they're not going to escape out of the out of the ivory tower. They're always going to be locked in academia. Five years later, I'd learned about neural nets and came to the same conclusion. They're not going to escape out of academia.

[02:50:29]And uh well, in my defense, I was right for several decades. It's only been the past five 10 years that neural nets and uh large language models have become popular. And the fundamental question I have which I don't have an answer to but the notion is can we use large language models to help us build formal methods and uh um make formal methods more accessible and usable. So uh with that the start of the uh I'll start the actual presentation. Um so of course C programs are not memory safe and we're going to focus in this project on temporal memory safety. So we're ignoring things like buffer rubber flows. We're instead focusing on things like use after free and double free.

[02:51:16]So the argument is that we have we are providing a uh model of managing pointer ownerships lifetimes. uh it's inspired by Rust and inspired a little bit by C++'s Ray and it's inspired by the discipline that every C programmer who's ever had a memory error in their program and spent hours debugging it as I had you know in my student days they quickly have to develop something like this so we have a model it covers lifetimes much like Rust um the uh model helps reduce the problem of of uh proving temporal memory safety to a static to a satisfiability problem. So it can be done statically without having to actually run your program. As you might guess, building a uh what we call a P model that that's the model applied to a specific program. Building a P model can be tedious and long because you have to represent every pointer, but LLMs can help do help uh um reduce the tedium and make it cheaper. And finally, we actually have test results. Uh we ran this on 4,000 cases from the Juliet test suite. So I'll start off with an example of code. Um for for my students, I like to ask you what's wrong with this code and let them figure things out. For here I'll simply point out uh what you can see pretty quickly which is that uh memory the memory from error message is freed by line 15 by the usage function and it's freed again by line 16 resulting in a double free. So this program is not temporarily memory safe.

[02:52:54]So the problem is in order to do that you have to look at the code for both functions. We call that whole program analysis and whole program analysis is uh it it's workable in small examples like this but it does not scale to real world code that has more than two functions. So the idea is that we construct a P model for each function which is what we've done here. And doing that you can look at the P model and that and each function individually. So when you're analyzing main, you can ignore the codebase for usage and just look at the P model and determine that uh error message becomes a zombie uh in line 15, meaning it points to freed memory and it's freed again on line 16.

[02:53:38]So in other words, we can we don't have to run this program to see that it's not actually memory safe.

[02:53:45]So the uh the main crux of the uh of the model hinges on responsible versus irresponsible pointers. Uh for you uh Rust experts out there, they correspond to box pointers and references in Rust.

[02:54:01]And uh so we're basically trying to apply the same model that Rust borrow checker uh has and apply to C. So responsible pointers are pointers that point to uh chunks of memory on the heap and when the responsible pointer goes away the heap me memory should also be freed. In particular each heap object should be uh referenced by exactly one responsible pointer and therefore you know that handles the aliasing problem. If we have a responsible pointer we know that no other pointer can be dreference that points to it. uh you know so that way we manage we safely manage heat safety.

[02:54:39]Irresponsible pointers uh are like reference pointers are the blue ones in this diagram and they can point to anything. They cannot participate in memory management and the main concern with them is that they might point to something that's been that eventually gets freed. So for example this box if this chunk of memory gets freed then the uh then we need to make sure that the pointer to it never gets dreferenced.

[02:55:02]you know, that would be a use after free error. Uh, I'm not going to go over these other threes. Uh, well, except for diligent pointers because diligent pointers are kind of the are kind of the boring, you know, readonly things that that um, you know, we'll see those a little bit in later on. And fortunately, most pointers are diligent. They don't participate in memory management.

[02:55:22]Okay. So, so this is what we wound up building. uh a combination of a builder and well this whole thing on the left is what we call a verifier.

[02:55:35]So the builder of course constructs P models and in our case it uses an large language model. In our case we used um GPT04 mini but you could theoretically use any any large language model today or you can build um you can build P models by hand. The rest of this is uh what we call a verifier and it either produces it either says you know the constraints you have are satisfiable or they're not satisfiable. We're using a SAT solver and so if they're not satisfiable you get an unsat core and uh at the base of it an unsat core is effectively a boolean proposition. It's kind of a proof that uh that things are not satisfiable in conjunctive normal form. So it will be something like A and B and C and not A which is contradictory.

[02:56:23]So when we first built this we kind of assumed that people would use the builder and the SAT solver together and the constraint generator that uh feeds constraints to the SAT solver would pull things from would pull constraints from the builder's podel and from it would infer some constraints from the LVMIR you know which is derived from the source code by clang. uh we actually realize that there's a second uh subset here where you can dispense with the P model that's why the P model line is dotted and you just infer constraints from the uh constraint generator or from the LVMIR we call that the whole program stat solver because it simply answers the question does there exist a P model that can that can satisfy this code or not you know is the code internally consistent or not um so that turned out to be useful for testing. Doesn't tell us what the P model actually is. It just tells us whether or not there exists one or none.

[02:57:22]Um, so we wind up using that for testing. You'll see in the testing slides in a little bit. We expect that most people will want to use the builder and uh and the LM to construct P models.

[02:57:33]Podels are actually good for representing the the uh the intentions of the programmer and they're good to maintain as the program evolves. uh you know if a P model says this pointer must be good and passed to a function and someone writes a writes code that passes a null pointer to the function that can be caught at compile time without actually you know and save yourself some debugging effort.

[02:57:57]So ultimately we have uh we have three tests that test the various components.

[02:58:03]The first test just tests the the builder and in particular the LLM. The second test tests the whole program sap solver and ignores the builder. And the third test puts things all together.

[02:58:16]So uh I'll go over the each of these in term. Uh for the first test, we simply construct a manual P model and then ask the builder ask the LM to generate a P model for us and did it get the answer right? For this test we used uh the DOSS Unix program. That's one of my favorite programs because it's fairly small. It has um you know it has uh um 169 pointers. It work it's actually designed to solve a real world problem converting new lines between DOSS format and Unix format and vice versa. So I won't claim it's the smallest but it's one of the it's one of the smallest bits of open source code that actually does something productive. It was not just written for someone's homework assignment or someone to teach themselves programming. So we kind of hoped that uh the system would garner would uh that the builder would correctly guess uh uh determine the p models for all 169 or or we were hoping for a 95% accuracy rate. We w up getting 159 pointers correct 10 pointers wrong.

[02:59:19]So that accuracy rate is just a little shy of 95 is 94.1%.

[02:59:24]You know which is um close enough. And what we have here is a error breakdown of the 10 pointers that were incorrect.

[02:59:32]We could easily, you know, put effort towards fixing one of these and uh to get ourselves to the 95%. But that 95% is an arbitrary uh uh limit that we set for ourselves.

[02:59:47]Okay. For the second test, we rely on Juliet. Um, for those who don't know, Juliet is a large test suite uh, released by NIST of about 50,000 small C and C++ programs. Uh, they're either good or bad programs. Bad programs violate a particular CPE. And each bad program says, you know, at this line, we're going to violate this CBE. And at the good and the good programs are like that, but they don't violate CBE. So, uh, um, you know, 50% good, 50% bad. And for our test, we are only concerned with the five temporal memory safety CVEes.

[03:00:26]So, uh, again, we ran just the whole program sat solver, no LLM, no builder, you know, were these CVEes or or were these programs were the good programs truly marked as memory safe and the bad programs marked as memory unsafe? And as you can see in these uh pie charts under both good and bad, there were 10% of programs that uh used features we did not support. There's some features that uh you know we haven't that um well note to self don't ever say that we can't do this. Just simply say it's future work.

[03:01:00]So uh for instance static pointers are future work. In fact uh static mutable pointers are even not supported by safe rust. So that's kind of something that we need to work on. So uh the 10% that is grayed out here are were things that used features that are future work. Uh of the things that we support all of the bad examples were recognized as bad by the uh whole program sat solver and all but 9% of the good programs were recognized as good and those that 9% is uh works out to 200 programs and we did an error analysis of these uh 200 programs.

[03:01:40]So what we discovered is about half of these programs were cases that we said they are memory safe but they violate palm they violate our pointer ownership model and in those cases 100% sorry in those cases uh I should add that uh one feature of our model is that it doesn't capture every uh temporally memory safe C program. It can certainly uh there are certainly things you can do in C code that's not uh palm compliant such as using reference counting or garbage collection or other more sophisticated uh memory management techniques. Uh and what Juliet does in a lot bunch of cases is it uh takes a pointer that points to the stack returns that pointer from the function. As you might guess, returning that pointer automatically makes it a zombie. And uh and dreerencing that pointer would be undefined behavior. And in real world code, you wouldn't return that pointer if you're not going to if you're not going to use it. But Juliet is not real world code. It's not natural. It's test code, kind of artificial. And Juliet therefore has lots of code examples that return, you know, zombie pointers, but never actually use them. So the the code itself is memory safe but it is not uh compliant with palm.

[03:02:58]Um a third of the examples were cases where you know the code was memory safe and comp and I guess you could say it complies with pom from a visual from a programmer standpoint but it uses it basically used analysis techniques that foiled static analysis. For example, if you allocate memory using Malo, you're supposed to free it exactly once using free. Freeing it no times is a memory leak. Freeing it two or more times is a double free. Well, what if your free function is buried inside a loop, a for loop? Then the static analysis will say, "Sorry, this loop may execute zero times or multiple times. Not safe, not compliant with palm." Juliet however said simply has a loop like what you see there where where the loop is designed to run exactly one time that u you know so the code is memory safe but um our our static analysis didn't recognize that the loop runs exactly once and trust me while this code may be memory safe if you ever submit to me a merge request a pull request with that kind of loop in it well you're going to get I'm not going to be happy with your code.

[03:04:07]So onto the third test where we put things together uh the builder, the LM builder and the SAT solver and these were the results. We get the uh ranges the LLM produced the correct P models between 72 and 95% of the time. This only happened occurred we only did 4,100 of the tests. Those are the ones that were that did not use unsupported features. So, so this is the results and I like to think of this as a uh glass half full thing. You know, instead of generating 4,100 P models yourself, the LM got, you know, between 72 and 95% of them correct. When it gets things incorrect, then you'll get a then you'll get an unsat core for a particular function and you can and you would have to debug that function. So ultimately, if you wanted to get correct P miles for all of Juliet, you would just have to fill in the tops of these things until they all reach up to 100%. That's a little bit of work for you, but but you know, the LM has saved you from a lot of work.

[03:05:08]So speaking of work, future work again, lots of future work to do. We'd love to do this uh um this this work in time, such as handling arrays of pointers, uh static pointers. We uh there are some there's a few functions such as the real path function at PZIX that uh simply are well they are fancier than what our model can support.

[03:05:31]Likewise uh we did all this work with uh GP04 mini. There are newer fancier models out there today that we can work with. To me the we have a sketch of a proof but uh making that form completely finishing the formal proof is future work. But to me, the most interesting future work is we've just used formal methods and an LLM to to to work on and partially solve temporal memory safety. Could we use this to do uh spatial memory safety? You know, can we identify buffer overflow statically?

[03:06:06]Could we do concurrency safety? Lots of interesting questions.

[03:06:10]So, uh I'll give a quick shout out to my team who put this together. Some of them went beyond the call of duty. Uh so, they did a great job. great team and uh so to conclude to conclude we have a formal semiformal model showing memory safety we can use an LM to build this model for any program if the LLM hallucinates then we what we wind is up getting is uh a positive which might be a false positive it might say this program that's temporal memory safe really isn't memory safe it does not ever give you false negatives it will never say this memory unsafe program really is memory safe. You know, hallucinations are caught by the by the SAT solver and produce unsat cores, you know, saying the program is not memory safe. Uh, and as you see, the Juliet case correctly recognized all unsafe programs and recognized correctly most of the safe programs. So, uh, in conclusion, we have a website available and a GitHub project. Uh, the GitHub code was used in the artifact, the reproduction artifact. And uh finally, last thing is we will be giving a presentation of pointer ownership to to the SEI summer series uh happening in Virginia on uh July I think yeah 23rd.

[03:07:27]So uh I welcome you all to come over there and and listen in on things. Thank you.

[03:07:33]>> Applause.

[03:07:36]Are there any questions uh that people have for David?

[03:07:41]Oh, I should quickly mention that there are three of the badges given for this paper. For the previous paper, there was the red and the green badge in the previous paper was a candidate for the best paper award for the uh conference.

[03:07:54]And the next p talk which is going to be given is also got two badges. Uh well just a question to David. Um how do you deal with uh you know you had your irresponsible pointers and if you did operations on them you know change what they're pointing to or uh dreference them or something. Uh how does does your uh analysis catch that?

[03:08:23]>> So the an the point of the analysis is to catch lifetimes of irresponsible pointers. So while a po while you can perform arithmetic within an irresponsible pointer, it is important that it not it not exceed its array.

[03:08:36]That is a uh that's a constraint in the C language. You can't like hop from this object to that object if these even if these objects happen to be consecutive in memory, you know, unless they're part of like a single larger object. So the main thing is to simply make sure that the irresponsible pointer doesn't outlive whatever it points to.

[03:08:53]>> Okay, >> does that help?

[03:08:57]a question here.

[03:09:01]>> Uh great talk. Um I have uh one question. Uh so we you know one of the the parts of this is that LLMs do hallucinate. Um and you found in your code around 94% correctness. Uh out of curiosity um does that change with larger code inputs versus smaller code inputs? Is it more likely to get like the smaller code inputs correct or does it scale poorly or well? Uh that's a really good question and finding and giving a more nuanced answer to that question is future work.

[03:09:30]>> Thank you. Perfect.

[03:09:31]>> DOS Unix is a fairly small bit of program. Juliet is a large test suite of small programs. So we haven't really gotten to answer that question on larger programs yet.

[03:09:46]>> Uh so just something that I don't think you mentioned in your talk. What's the time scale like on solving this model?

[03:09:54]How long does it actually take to analyze a model, run it through the SAT solver, etc.?

[03:10:01]>> Uh, very good question. I don't have any of that data down because because it always was um instantaneous.

[03:10:11]So SAT solvers of course could take uh long a long time if given really huge amounts of code but we have not managed to give it large enough code to to make it take a long time yet. So future work again.

[03:10:26]>> Thanks David. Let's give him a round of applause.

[03:10:29]>> Thank you.

[03:10:30]>> Our next speaker is uh Norimat Tanakasan from Sukuba University. He's going to be talking about Hikami, a lightweight hypervisor for emulating risk 5 emul extension semantics with salriven autogeneration.

[03:11:13]Sorry.

[03:11:14]>> Yes.

[03:11:15]>> Okay. Uh, thank you very much for introd Okay. Thank you very much for introduction. I'm Natakana came from University of Tuba, Japan. Today I'm going to present you Hikami a righte hypervisor for emulating risk extension semantics with say mode generation.

[03:11:36]So first I'll provide a brief introduction to risk 5. Risk 5 is a modern opensource instruction set architecture.

[03:11:47]Its royalty-free nature encouraation and allows for significant customization.

[03:11:54]A key aspect of this file is its modularity.

[03:12:00]ISA is divided into extension running hardware designers to include uh only necessary features.

[03:12:07]However, this modality has led to a significant trend.

[03:12:14]As shown in diagram, the number of offsh extension is growing extremely rapidly with nearly 150 standard extensions currently defined.

[03:12:26]In contrast, the number of extension actually available in silicon rs far behind.

[03:12:33]According to our empirical survey, only about 40 extensions have confirmed adoption.

[03:12:39]>> There is from your purple. Yeah, that's good.

[03:12:42]>> Yeah. Oh, sorry. Uh uh only about 40 extensions have confirmed adoption in commercial uh community hardware. This creates a sever adoption gap.

[03:12:56]This hardware gap is our first problem.

[03:12:59]Extensions uh range from simple instructions to compress features with a new CS and exceptions.

[03:13:08]It is impractical to build hardware for every combination remaining developers without target hardware for a reso validation.

[03:13:18]The second problem is exist stagnation at skin and egg cycle between the hardware and software development.

[03:13:27]where Mr. Rators like QM exist and they abstract role level interaction making a harder couple validation difficult.

[03:13:38]So how can we bridge this adoption gap?

[03:13:41]How can we enable a resource validation without waiting for uh expensive silicon to bridge this gap and we need a validation environment on physical hardware meeting a native speed for existing code.

[03:13:56]The key requirements are ease of use high reliability and uh near native performance.

[03:14:05]Next uh I will review existing approaches for extend extending the ISA and their limitations.

[03:14:15]One approach is hardware assisted validation using FBZAs.

[03:14:20]The ROCC interface allows extending the processor with accelerators that tightly coupled via the on bus.

[03:14:30]This setup receives the instruction as uh commands and returns uh result uh enabling hard revises for every configuration.

[03:14:49]limiting early stage software development.

[03:14:54]Another approach is a firmware via project by open space h using a regal instruction traps in M mode.

[03:15:05]While portable executing immersion in M mode uh which is a uh highest level mode uh violates the principle uh of risk privilege and incurs high overhead due to rock virtualization support.

[03:15:22]Uh to address this challenges uh I mentioned earlier we need to choose the right priority level for emulation.

[03:15:32]Existing em mode has a high overhead and security risks and u mode immation is a secure brow.

[03:15:41]Our research for uh focuses on a type of hypervisor in uh HS mode uh revering batteryization uh for a row overhead while keeping emulation logic is rated.

[03:15:56]Based on this approach uh we developed hikami. Hikami is a rightweight type of hypervisor specifically designed for risk five. Our primary design goal was to keep the trusted computing base as small as possible focusing only on essential function like a memory management and handling.

[03:16:17]This architecture all ros the guest OS to run a near native speed and provide a direct access to harder peripherals meaning we can use an a modified and bend drivers.

[03:16:30]Yeah.

[03:16:32]Uh emulation logic is implemented as a modular hypervisor modules corresponding uh one to one with a describe extensions and uh integrated using the rasex system.

[03:16:47]For example, uh supporting the ZBS uh extension requires only add only adding uh two lines uh to the configuration file.

[03:17:00]It drastically reducing a setup cost supporting an extension requires handling three uh core elements trapping new instructions uh intercepting new CSR uh or control and status registered accesses and injecting new exception causes for instruction.

[03:17:27]uh when the guest executes an an implemented instructions, it triggers an an illegal instruction trap to the hypervisor.

[03:17:37]The hypervisor uh decoded the instruction and execute its semantics in software by updating the guest registers context and finally returns the control the guest as the next instruction for C CSR emulation. The hypervisor intercept access to the new CSR and divert them to a B CSR state in a product memory area.

[03:18:09]For exception injection, we simulate a new exceptions by uh directly modifying the G system register forcing the guest to jump to its exception handler.

[03:18:25]While this architecture is powerful and implementing these modules manually remains a challenge.

[03:18:32]Developers must exhaustively read all new instructions and manually write complex bit m bit masking logic for decoders and implement a lot of boiler plate code.

[03:18:46]This manual process is not only reor intensive but also highly prone to s of bugs. So how can we automate this process and ensure its correctness?

[03:19:01]To solve the challenge of manual implementation, we propose an automated framework driven by former specifications.

[03:19:10]As I mentioned and manual implementation is a labor intensive and errorprone.

[03:19:17]Developers have to manually pass the PDF specifications and transcriber a bit patterns into code which is a major source of bugs and inconsistencies.

[03:19:29]Our solution is to automatically generate instruction decoders and hypervisor module templates directly from a sales specification.

[03:19:39]This ensures that our emulation logic is always in sync with the former ISA definition.

[03:19:48]So what is say? Uh say sale is a for rangage specifically designed for describing the semantics of instruction sets.

[03:19:58]It is already widely used by the risk community to generate officer documentation and test suite. By using sale we revisit an authoritative and ambiguous source of truth for all describe extensions.

[03:20:17]This is what cycle looks like. It defines a mapping between the instructions and the binary uh encodings.

[03:20:26]Our framework passes this definition to extract the necessary information for automatic generation.

[03:20:34]First uh we generate the hypervisor module templates.

[03:20:39]The hypervisor module are the core of our emiration emulation system. So it is critical to avoid implementation gap or logic errors. To achieve this uh our framework automatically extracts the extension name or defined instructions and all new CSRs directly from the sales source.

[03:21:02]This ensures that no part of the extension are overlooked during the implementations.

[03:21:11]The framework is a R template with a match expression for every instruction.

[03:21:18]The de the developer only need to fill in the semantics logic for each case by replacing a to the macro uh like this.

[03:21:31]Uh this draically reduce the boiler plate code and prevent the human implement from overooking any spec specific instructions.

[03:21:42]And the second part of our framework is automatic generation of instruction decoders to handle traps. Uh the hypervisor must convert a low binary sequences from guest memory into structed instruction object.

[03:22:00]Writing these decoders manually is extremely tedious and errorprone due to the complex bit feed mapping in risk.

[03:22:09]We automated this by analyzing the ank deck maps in sale.

[03:22:16]By identifying the construct op codes and variable plant bits, a framework can automatically derive the logic needed to identify and disassemble any instruction.

[03:22:32]A generation workflow uh follows a five steps uh from extracting instruction to constructing a deterministic decision tree. The final output is a set of optimized ras match expressions that provide fast and correct encodings.

[03:22:54]Our framework with a highical decision tree from a field data.

[03:23:01]This ensures that every possible bit pattern is mapped to either specific instruction or an error providing a systematic coverage of the encoding space without manual errors.

[03:23:18]Uh the emitted RS code is clean, efficient and guaranteed to match the sale specification.

[03:23:26]By automating this uh we eliminate a major cost of bugs uh commonly found in manual written emulators and finally uh we generated automated unit test to verify the decoders.

[03:23:43]The framework synthesizes instructions with a random operance encoding them according to the specification and then check if the generated decoder can correctly pass them back.

[03:23:57]This proes uh provides an array of assurance for our implementation.

[03:24:04]Now let's look at the evaluation.

[03:24:08]We we evaluated hikami from two perspectives.

[03:24:13]emulation performance on physical hardware and overall impact on the system including resource efficiency and real-time responsiveness.

[03:24:24]Experiments were conducted on the MI 5 meg a real platform supporting the hyper extension.

[03:24:33]For fundamental validation, we target the ZBS bit manipulation extension uh which is not natively supported by the hardware.

[03:24:44]We verified that the guest program using the uh using this instruction could run correctly on the physical board.

[03:24:54]This successfully demonstrated that Hikami can provide the correct semantics for unimplemented extensions on real SEC.

[03:25:04]Uh to measure the emulation overhead uh we conducted the number of instructions executed by the hypervisor to handle the single uh B set B set instruction drop.

[03:25:18]uh emulation overhead is approximately uh 300 instructions per trap about half comes from uh the modular dispatch mechanism at full age of use is lightweight uh with the core of about uh 7,000 running of code. More importantly, uh our generation framework provides a huge product regain in the database module. A nearly 75% of the code was generated automatically.

[03:25:51]The developer only had to write 28 lines of isetic semantics logic which is significantly reduced the potential for human error.

[03:26:04]Finally, we compared hikami with QM using quark to measure environment file performance. Hikami maintains 99.8% of native performance. In contrast, a QM's fation operates at only 8.5% of native speed.

[03:26:22]And in realistic fros such as the OS boot or driver initialization an implemented instruction typically account for less than 0.1% of total execution.

[03:26:35]Uh in this scenarios hikami is up to 3.4 times faster than QM. This makes the hikami a spial validation environment for complex and harder couple software stacks.

[03:26:48]And to conclude, we propose a system combining a batchization and automate automation to address the risk of hardware gap. First uh our rightway type hypervisor enables a near native execution with a mod support for new extensions.

[03:27:04]Second uh our saledriven framework automatically generates a decoder and templates uh eliminating manual errors and reducing implement implementmentation effort.

[03:27:16]Third and finally uh evolution on real hardware shows our system maintains a 99.8% native performance achieves a sub micro second latency and outperforms QM in realistic macros. Uh this rightway design provides an efficient foundation for testing and retrofitting new extensions. So thank you for very much for your attention.

[03:27:48]Norasan, there's lots of time for questions. So, do we have any questions?

[03:27:58]Uh, so while other people are thinking of their questions, uh, just wanted to ask you first, uh, you know, a light-hearted question, which is what does hikami actually mean in Japanese?

[03:28:11]Uh it >> does it is it mean something? Is it is does the word mean something?

[03:28:19]>> Uh it is uh distriction name.

[03:28:24]>> Okay.

[03:28:25]>> Yeah. All right. So uh one one of the things I was wor uh interested in knowing is you trap an unsupported uh instruction and then you're going to go to the hypervisor. What happens to the instruction decoding pipeline? Do you just sort of do you execute? Do you start the execution of the next the decoding of the next instruction or do you pause that?

[03:28:47]So suppose there you you take something from the extension you need you're not going to execute it in native mode but the next exe the next instruction in the pipeline uh is it you start decoding and start processing it at all if that's in native code >> if that's in in native mode >> we we code each instruction one by one.

[03:29:10]>> Okay.

[03:29:10]>> Yeah. you so so you you're not considering the possibility of you're going to interrupt the pipeline and uh okay are there any security implications of running things in HS mode versus native mode uh have you could could you comment on that >> uh I couldn't uh I couldn't solve the aspect so >> so so there could be some security implic applications, right? If you're >> simulate em emulating something in hypervisor >> versus how it would have run on if it was really supported on a machine. Uh how do you deal with those information security leakages and so on?

[03:29:59]Uh uh risk five risk five hex hypervisor extension supports u memory isolation.

[03:30:10]You have isolation memory isolation.

[03:30:12]Yes.

[03:30:12]>> Okay. All right.

[03:30:13]>> And you prefer the HS mode over the M mode.

[03:30:17]>> Sorry.

[03:30:18]>> Is the HS mode always preferred over the M mode for security reasons?

[03:30:22]>> Yes.

[03:30:22]>> Okay. All right. Any other questions?

[03:30:24]Sorry, I've been asking the questions.

[03:30:26]Yeah.

[03:30:27]>> Okay. Let's give Norimasan a big hand.

[03:30:31]>> Thanks so much.

[03:30:32]>> Thank you very much. And remember the remember to check out the artifacts, right? and play around with them. Uh the last talk and let me get this in my hand. This session is a work in progress talk. It's a scheduled partial credit RL for reliable code generation with small language models. Um and Surange Singh Sally is going to present it. You're from Penn State is that so thank you. So the mic is here.

[03:31:15]Thank you.

[03:31:19]Okay. So it's 10 minutes. What I'll do is I'll give you a finger saying two minutes or so to go. All right.

[03:31:47]Do you have an HDMI connector?

[03:31:51]>> I think so. Yeah.

[03:31:53]Let me try.

[03:31:56]It works. Yeah. I thought uh hello everyone. So to present this >> all right uh so I'll be presenting this checkpoint of our overall work uh that focuses on improving code generation processes in small language model. Uh so here we are focusing specifically on uh scheduled partial create RL pipeline and I have uh configured the presentation into answering four specific questions which are why, what, how and what's next.

[03:32:52]So answering why first. So why are we focusing on small language models specifically? Um we all know that LLMs are like pretty good at code generation.

[03:33:01]They're beating benchmarks every second month. So why are we specifically focusing on small language models? So the very direct answer would be that small language models can really fit inside of our um specific deployments inside of PCs inside of uh edge devices and if you go very small you can also fit them inside of some embedded systems. uh and what is the problem that we are really solving because there are small language models that are fine-tuned for code generation. Now the thing we found out was that these fine-tuned models usually produce only 50% of syntax valid code which is a really big issue because these models promise uh really good security or secure code but only 50% of them are actually syntax well. So there is a big loophole in there.

[03:33:57]Uh what does our work specifically entail? We also set out to solve the security paradigm. But the issue is not really security. The issue is compilation. Because a lot of these models actually promise more than 80% of uh security valid code, security clean code. But if you go in deeper and analyze the secure code, you find out that 50% of them are stops that don't really even compile. But since they don't have a vulnerability that can be caught by an SSD2, it comes out a secure clean, but the code actually does not even compile. So we found out that reliability is something that we have to focus on. We have to focus on code that actually compiles first and then we can focus on security. So we pivoted to producing code that actually compiles and is syntax valid first and for that we created a joint reward which basically prioritizes code functionality over code security.

[03:34:57]Uh so how does this solve the problem?

[03:35:01]We started off by uh selecting a baseline which is very low in terms of the number of parameters which is 1.3 billion. Uh this is a pre-finetune version. So deep coder is fine-tuned for coding. Uh but even so the baseline only produces about 20% uh syntax valid code. So we applied a lora based uh supervised finetuning uh pipeline and we used as plus data set as the baseline for the finetuning. This only fine-tuned about 43% of the total number of uh parameters. So total number is 1.3 billion. We finetuned close to like 60 million and this was only behavioral uh finetuning and we did not really do anything with uh how the model produces code and we saw significant improvement there but the main RL portion comes next. So after we got the super supervised finetune model, we ran it through a PO training pipeline which consisted of um a a double reward setup which included a functionality reward and a security reward. Uh the security reward consisted of the bandit SSD2. The functionary reward basically consisted of how many test cases it's passing and from that we got our main model that we'll be evaluating later and how did this solve the problem. So the main problem we face when evaluating or when training small language models is that the reward is too sparse. So the model finds it really difficult to learn anything based on the binary rewards. So we broke down the reward into partial credits so it can sort of have a ladder to climb. We set up like specific milestones for the functionality reward so it can be easier to train it. Uh if it's syntax valid that's like the minimal requirement it will get at least some reward and then if it's passing more and more test cases it's get the it's getting the full reward. And to maintain the security part of our reward, we had a guardrail using uh medium and hard security vulnerabilities. If there are any, it will basically uh not give any positive reward to the model and the reward will degrade. So the model is forced to maintain security while also focusing on uh reliability.

[03:37:27]And the result we got was pretty significant. So as I did mention that binary reward is really sparse when we go down the scale uh in terms of small language model training. So the binary reward actually is worse than what we got for our SFD model. Uh we only got 18% syntax valid which is like uh 26 percentage points less than even the baseline. This happens because just using a zero or one for functionality does not really teach the model anything at this scale because just think about it this way. If you don't know about a specific topic and you're attempting it, you won't be able to learn without any feedback what you're doing wrong, right? So it's difficult for the model to explore and gain any insight about what's wrong and what's right. with the partial credit it improves slightly but what really helps is a a schedule framework. So we give our model a warm start using the binary framework and then move on to partial credit. So it has uh a very rigorous exploration set up by the binary because it's not able to learn anything. So it has to explore by its own and with the partial C it's sort of smoothened and because of that we get a really high syntax validity score in our curriculum uh model.

[03:38:51]Now this is just what we have done so far. Uh since this is work in progress there's a lot we can do in the future as well. Uh the very first thing would be to tighten the reliability result. Uh that's because our current version does not really have a very bounded C a very tightly bounded CI. So we can probably run more seats uh run more steps to get it to like plus minus 2% I guess. And secondly we can also fix the hyperparameters and ensure that we are using the best ones. We can run uh a rank based ablation for our fine tuning platform. We can also play around with the learning rates. Uh there's a lot of other ablations we can do. We can generalize the small language models and try it on different RL algorithms. For this we use PO which is just one of the many. PO is not really memory efficient as well. So we can probably try GRPO.

[03:39:49]There's a lot we can do. uh but our primary end goal is to create a model that's both reliable and secure uh and generates good code in the end.

[03:40:02]So the key takeaway of this presentation is that sparsity is the failure mode in terms of binary uh credits in both security and uh functionality frameworks and to fix this in small language models we have to utilize some sort of curriculum or shoulduling to get around uh the failure mode. Thank you. Uh I'll take in any questions now.

[03:40:31]Thanks. That was very timely and okay any questions.

[03:40:39]Great.

[03:40:43]>> Hi uh thanks for the talk. Sorry if I missed it at the beginning but uh what language are you asking these uh LLMs to generate?

[03:40:48]>> So for this we focus on Python and we have extended this to uh three languages C++ and C as well. And do the results vary by language in terms of syntax validity percent? uh so earlier so syntax validity does not really uh matter as much in terms of how uh or what language we're using but for security it's mostly that Python is very highly uh compile and security is like if we take both of them into account Python is the most reliable and C and C++ DLM really or SLM really struggles with it for some reason >> so yeah >> by reliability you mean or as in you mean the C and C++ programs are less secure than the Python ones, >> right? Yeah, >> probably just because C and C++ are not memory safe is >> I believe. Yes. So, we tried 19 uh CWEs when we were uh testing for uh the security relevant uh prompts and eight of them were memory focused and those were the most erroneous ones too.

[03:41:55]So, yes, >> I guess that makes sense then. Thanks.

[03:42:02]>> So do we have any other questions?

[03:42:07]>> Yeah, just a short question. Thank you for the talk. So I because in the first time you show you uh the small model is deepseek, right? So I just curious about what's the reason to choose this model.

[03:42:17]Is there what's the problem for the other small models? Yeah.

[03:42:20]>> Uh the deepse 1.3 billion.

[03:42:23]>> Oh yes.

[03:42:27]>> Is it this one?

[03:42:28]>> Yeah. So yeah, this one you said you are using deepseek model and I just curious why why why you choose deepsek at at least points. Is there any problem with the other small models? Yeah.

[03:42:39]>> All right. Uh so 1.3 billion parameters is like one of the lowest ones we can have that is fine-tuned for coding. Uh that's one of the reasons. Uh it's open source. That's another reason. Uh so we started off with 7 billion parameter models but this actually beats them with the whole pipeline. So it's sort of like setting up the lowest point we can uh train the model from and get our results high. Yeah, >> I see. Sounds good. Thank you.

[03:43:08]>> Any other questions?

[03:43:10]If not, let's give Surange a big hand.

[03:43:13]And I'd like to thank all of you for participating in this session. Want to thank the speakers. Let's give them all another round of applause. I think we had a great and remember the first talk was a candidate for best paper in the conference award and there were two candidates and this was one of them and as organizer do you want to say anything so then we meet after lunch so enjoy yourselves thank you And I'm going to do lunch today was really good. So, I don't blame you if you fall asleep during my talk. Uh especially since I'm used to it when when I'm teaching classes, all my students are sleeping too. So it's okay.

[05:26:40]>> Oh, there you go.

[05:26:42]>> Let's get started. Uh welcome to session four specialized hardware and accelerator design LCTES and hope you guys enjoyed our lunch as uh Scott our first speaker Scott mentioned and Scott is going to introduce his uh research paper can fine grain multi-threading sub zoom vi thank you I appreciate it uh so yes I'm Scott Palmerville uh this was a paper in conjunction with Michigan Tech and Florida State as well. I'm from Northern. Um this is an homage to uh the title is an homage to uh can dataf flow subsume vonoyman computing. Uh so if you know the history of that um the answer ended up of of course now you know is no. So uh don't assume I'm promising the moon because the answer might be no. But we're going to try our best to to convince you that there might be a case here. So, uh, really quick, we'll we'll just jump through here. Um, and just to give a background because VIW, it's a thing we don't really think about every single day. Um, but this is kind of what a VIW looks like if you squint really hard. Um, so over here you can see an instruction stream. Uh, and this is a three- wide VIW. Oftent times VIWs are powers of two, but this one I'm just trying to illustrate it. And each essentially instruction is a bundle of operations. So uh you have an add a subtract and a load word as part of an instruction. Then the next instruction might be a store, a multiply and a noop.

[05:28:09]And each instruction is going to end up essentially getting fetched by this fetch unit. And then the first third might go into this sort of pseudo pipeline here. The second one goes into kind of this this pipeline here and the third instruction would go into this pipeline here or vice versa. So or or something like that. So the idea basically though is that a VAW processor is much simpler than what we consider our modern sup modern supercaler processors and these instructions are going to have to be uh essentially set up in a way or these operations are going to have to be scheduled by our compiler uh to maintain alignment. We oftent times pad these instructions out with noops uh although we try to then compress them afterwards with specialized hardware. Um, but that's the basic idea and very simple version of a VIW and we're really only going to be talking about the basic versions here today. So the motivation here is really um with VWs parallelism is is bundled with uh sort of this instruction representation. VIWs have these noops and compiled code really needs to match VIW width. So if I have a four wide or 8 wide VW, I need to make sure my code matches that width exactly or have again special hardware that's going to modify for instance the PCs or the PC increment to handle different width code which means we have to have specialized hardware for backwards compatibility.

[05:29:31]But what if we have a different encoding? What if we consider that this might not be the only way to represent parallelism statically?

[05:29:40]uh new representation might have some potential to handle some of these problems natively. Um and also while we're at it might give us some new avenues for research. So instead let well what's another way we we oftentimes do it? Oftent times we use multi-threading and that's really nice but requires some extra intervention. So instead of though what if we just take these these VIW packs and treat them as lock step threads instead. So here I have some instructions here and I've just sort of removed the actual instructions and replaced them with I1, I2, I3. And so you can see here or operations. What if instead we just have multiple PCs? Each one is going to run sort of in parallel and lock step. You notice each lane has its own PC. And then what happens if we then apply maybe some simple synchronization mechanisms, just the ability for lanes to just turn off and on again um when they're just told to. scenario, what we find is that we can basically eliminate most of the noops, if not all of them. At least that's kind of the hope. And so what we're going to do is essentially treat these sort of individual sort of pseudo threads as a different stream, but each one is going to represent sort of the different parallelism uh or the the parallelism that a VIW would by using long instructions instead using these synchronized threads. Um, so what we're going to do is really add just a quick piece of terminology here. This idea of a lane. And a lane is really just sort of this pseudo thread coupled with the corresponding hardware that it's going to run. Because if we remember that VW has one single fetch unit and then a bunch of essentially just pipeline simple pipeline processors that share a register file. The idea is is instead let's just make sure that each sort of pseudo thread has its own individual fetch unit and go from there. We're going to define lanes essentially as being in one of three states. An active lane is just going to fetch instructions from whatever its corresponding stream of instructions are and run them like we expect. There's suspended lanes. And a suspended lane is one where what we'll do is we'll just say, hey, if you're um running and you suspend, don't fetch or do any instructions there. And so that'll allow us to essentially not have to encode those instructions. And then in a halted or inactive lane, it's just a lane we never turned on. and it doesn't actually have a corresponding instruction stream. It's just the hardware that's shut off. To prevent deadlocks in this sort of pseudo multi-threaded um situation, we'll have lane zero be always active and every other lane we'll refer to as like oneplus essentially. And those ones can turn off. And this entire architecture, the corresponding hardware as well as sort of this instruction representation is going to be referred to as a synchronized lane architecture because I'm not really good at naming things.

[05:32:27]All right.

[05:32:29]So the most significant bit is actually going to be one of the most important parts and we're going to talk about that a little bit in a moment. The ISA that we provided is actually a from scratch ISA. It looks a lot like MIP because I might as well use something that mostly exists so that it doesn't have to do everything from scratch. Um but it is its own ISA. um we call the SLISA and it has a bunch of instructions all dedicated to spinning up instruction streams as well as halting instruction streams as we go entering functions and so on and so forth because a couple things are going to change. We won't be able to get into that but if you're curious you can read about it in the paper. We have several registers as well that are also just really there to administratively be like are you running or are you not running right for states as well as um if we enter a function normally we have to keep track of one PC now we have to keep track of a handful of PCs so we have extra registers for that.

[05:33:24]The main kind of important part though is this idea of being able to turn on and off lanes which is really just a single bit in the instruction set that we call the suspend resume bit or an SR bit. A suspend resume bit sits in as the most uh significant bit of any instruction. And that way we don't have to decode the instruction to know if that bit is set. We can simply use it.

[05:33:47]And the idea here is uh I'll just add a SR in my uh assembly to say hey this this bit is set or not set. So all instructions basically have two versions one where it's not set and one where it is set. So an addu would become addu.sr.

[05:34:02]And that just says when you fetch this instruction don't fetch any more instructions. You're done and you'll suspend. At least that's true for latence one uh one and so on. So if they see that that's just the last instruction you're going to run. It'll run through the pipeline and do what it does normally, but it will no longer fetch instructions and it suppresses any access to the instruction caches. And each lane has its own individual instruction cache. Lane zero has a SR extension, but since it can't turn off, all it says is all other lanes that are currently suspended but still have a PC resume execution. Halted and inactive lanes do not have a PC, so they won't continue. So that's the basic idea of these SR bits for branches and jumps because each lane has its own kind of uh thing going on.

[05:34:50]Um we need to make sure that when a branch is taken all of the lanes update their PCs but it would be terrible to duplicate all of those branches and have a whole uh branch taken buffer for each lane. So instead we offer just a single register called a prepare branch register that just stores a PC and whenever a lane encounters a special instruction called a PB instruction. It just puts a value in that register and whenever lane zero takes a jump all other lanes that are suspended update their PC to whatever that PC uh is in the prepare branch register. So that's the basic kind of concept there. When entering a function, all lanes uh kind of halt just to make sure that uh the processor is in a predictable state. Um they all lose their uh things except for lane zero and lane zero begins execution right away. So the basic idea is then lane zero will be in charge of starting up new instruction streams to regain that parallelism that we want. One advantage of this scheme is that since entering a function uh halts all lanes uh the processor if you have some let's say a four- wide processor and you hand it two wide code or a code that's designed for a two-lane processor um those two lanes that are never used simply never spin up. And so the code actually natively is supported by um for for code for smaller processors is natively supported by ones for larger processors. We just don't utilize the full hardware.

[05:36:22]So what ends up happening is we end up with a processor that looks like this.

[05:36:26]Should look very similar to the VIW processor from before. The only difference is we've broken up the instruction fetch stages so that each individual lane has its own instruction fetch and its own instruction cache. So each cache is smaller but this allows us to essentially uh allow certain lanes to shut off when they are not fetching and processing instructions. So it's a big idea. One thing you can see is after for instance list load word we added the SR mechanic or pneumonic to turn off this lane and you can see there is no instructions that are fetched during that time. the loadword.sr and this load word would be next to each other in the code because there's no need to have no ops there. So that allows us to actually compress the code while we're at it. And you can see that the lanes resume execution when we encounter a load word.sr in lane zero which just spins up the other lanes again. So that's the entire premise of this architecture. the front end. Uh these look really scary, but they're just standard pipeline processor frontends. The only difference is is in lane zero, whenever there's a branch that's taken, it'll send a message to the lanes that are not lane zero saying, "Hey, resume execution." So that's what's going on there. Or sorry, uh update your value to uh your PD register. Sorry, the resume execution is the other change which is right here.

[05:37:50]Whenever we enter the instruction cache, we peel off the most significant bit and just say, "Hey, if it's one, tell everybody to suspend or to to resume execution if you're suspended." Over here on the hardware for the other lanes, it looks very similar. There's a couple small changes. The first change is that I don't have to have a full BTB.

[05:38:12]I just have a register with a PC in it.

[05:38:13]That's a PB register. And I have a two-bit register called the lane status register that suppresses PC updates and instruction cache accesses whenever a um the lane is either suspended or halted.

[05:38:26]So that's the main difference there. And you can see I peel off that SR bit to suspend if I ever fetch an instruction from my instruction cache that has an SR bit. Those are the main differences.

[05:38:38]There's a couple other small wires there. They are described in the paper.

[05:38:44]Um, so one thing we had to do when we were, you know, we made this and we did all this really cool stuff was, um, figure out how to compare apples to oranges, VWs and, uh, VW processors as well as the, um, SLA processor, they actually, because we use a bespoke ISA, all of the ISA, the ISA is actually the same between the two processors, but they still execute a different number of operations. VIWs have extra noops to maintain alignment as well as SLA processors have some administrative uh uh functions or uh instructions. So those administrative instructions unfortunately um we don't execute those in the viw. So we have these apples to oranges comparison. So instead what we've done is essentially fi pinpoint the instructions that are unique namely the no ops and these administrative instructions and we just omit them from the results because what we find is all of the other instructions are identical right so what ends up happening is in the same number if you execute you know n functions they will execute the same number of operations and the same kinds of operations aside from the differences in these noops and these sort of what we all meta instructions. So whenever we are looking at statistics, what we're going to do is only count non-meta instructions for performance because the meta instructions which we are these noops and these non um these administrative functions don't really contribute to the core uh semantics of the program. They're just there for correctness.

[05:40:21]So um these meaningful instructions are part of the core control flow. They are shared between both processors and in any window they execute of functions or branches they execute the same amount.

[05:40:32]We verified that. Um I won't jump too much into that for time reasons but they're there. So you know I didn't make up a fake processor. Uh so we use spec benchmarks. If you're wondering why we use such old benchmarks, it's because it was very hard getting this thing running in the first place and making a new ISA and a new compiler and everything else.

[05:40:51]uh we warmed it up for two billion non-meta instructions and then ran for 10 billion non-meta instructions these mini flow instructions and we tried two SLA processors one where we give the lane zero uh which is always on a larger instruction cache and one where we kind of just equally distributed the instruction cache for the um processor.

[05:41:12]Um so uh we we've evenly distributed it with the 888 and then lane zero gets extra space with the 16844.

[05:41:21]Um we use a in-house uh assembly optimizer called as opt as well as fast ADL for our psych cyclic accurate simulator and cacti 7 for power calculations and we looked at both four wide and eight wide configurations but we'll only focus on the one. So meaningful IPC short version they're about the same. So that's really what it comes down to. Uh we do it slightly better than VIW and that comes from the fact that uh while we have extra lows and stores from lane state uh management, uh our instruction cache misses are slightly lower per lane. Now if you were to sum these numbers up, they'll actually be very close to that number. But one advantage is is after a jump, multiple lanes can miss at the same time. So we can interle requests to level two cache, reducing effective miss penalty. So what ends up happening is that we end up actually having very similar performance. The power the instruction fetch energy is actually reduced and this is primarily the instruction cache and we find uh because lanes shut off. If a lane shuts off or is suspended, we suppress the request to the instruction cache. Those instruction caches don't run. If all lanes are active, the SLA processor uses slightly more power. But as soon as a lane suspends at all, we start to gain significant power uh significant performance when it comes to power. Um total instructions encoded as well because we don't have to encode no ops.

[05:42:42]Uh what we find is that the SLA processor again significantly uh reduces code size. We have tried four and 8 wide. Um so the 8wide VIW of course is significantly larger. The SLA processors however use less instructions to encode the same amount of information since they don't have to encode no ops. So kind of in conclusion just so that I'm making I don't know how much time I have. Uh but trying to make sure I'm quick because I went over time over time but uh short version we made this proof of concept processor was meant to be not necessarily a full working processor.

[05:43:19]the VLAW and the SLA processor were pretty bare bones, but it's a nice proof of concept to show uh this new uh code generation corresponding hardware is something that's actually practical and doable. Uh we showed the SLA at least out of the box has similar performance to a VIW processor while really providing these significant reductions in in fetch energy and code size. Um and we also delve a little bit into the scalability in the paper. Just can't do it really quick here. And really there's two major uh things we want to work on in the future. First is uh meta instructions do have a non a non-negligible amount of overhead especially in function calls because you have to keep track of more things. We want to aim to reduce that overhead as we move forward in designing sort of this SLA processor. The second thing is we actually want to target more state-of-the-art systems using sort of this in quotes multi-threaded architecture because what we would like to do is really um uh in improve and work on this sort of idea of having these sort of parallel you know pseudo threads.

[05:44:18]Um so I I want to thank you all. I appreciate it. And do you have any questions?

[05:44:27]>> All right. I got a million. Great.

[05:44:30]>> Yeah. Thanks. Maybe a very quick one.

[05:44:32]Yes. So you said that lane cedo would activate all of them at the same time but that may not I mean you may have some >> right you might have some propagation delay for instance. Right.

[05:44:41]>> No but what I'm saying is that maybe in the schedule you just want to activate lane three.

[05:44:46]>> Yes. So this is actually where sometimes we have to deal with a noop. We just have a noop.sr to put one lane that activates for one cycle back to sleep.

[05:44:54]>> Um yeah it's it's one of the only times where we kind of still have to use noopops. A very quick clarification for the evaluation.

[05:45:02]>> Um, so you said that you only measured like the effective payload instructions.

[05:45:07]>> Yes.

[05:45:08]>> Would that like the results that you showed would they also if I if I now measure the end to end say the latency of executing the entire thing >> will the results be the same or will this extra meta >> instruction the meta instructions yeah we we based it off so so the yes they they will map one to one. So if you have in any window um of time, the meaningful instructions are the instructions that are consistent for the both both processors. And we we ended up we we I did a lot of work in actually testing that out to make sure that that measurement was correct. Yes.

[05:45:39]>> Okay. Yeah, I'll trust you.

[05:45:42]>> If you're curious, I can I can talk about it offline, too.

[05:45:45]>> Cool. Uh yeah, very very interesting uh idea. I guess the first thing that came to my mind is like um VIW with stop bits, right? something like itanium where you can compress noops and I'd be curious so specifically talking about iicash fetch is that one thing about having a contiguous instruction stream of bundles is that you know your memory your memory um width response is going to be like 512 bits or something and if you have a bunch of contiguous bundles you're going to have a lot easier time filling that compared to if you have to find you know if you have four lanes you need to find four 512bit you know like um unless you're doing like fetch fusion um without control control flow divergence. Does this make sense where it might be easier to to have better eyeash behavior if you have continuous instruction stream rather than four independent ones?

[05:46:33]>> Yeah. So I I I I completely understand where you're kind of getting at it is is they're all over the place. So uh one thing that's kind of useful is the because the instruction streams are they're contiguous in the lanes that they are in. So so lane zeros all of their operations are still contiguous in in memory. So what we found is at least wi within the simulations that we were testing it out um that there is there is what what we end up doing when we lay it out in memory is having lane you know zero's code followed by lane one's code followed by lane so oftent times especially in smaller functions you end up getting a lot of the same performance. it it ends up not being a significant impact especially because lane zero is always running and it ends up getting those benefits so that when you fetch something it takes longer for one of those lanes to actually end up missing if you are not having a branching control flow.

[05:47:24]>> That was at least kind of the I I think I might have answered the question.

[05:47:28]>> So if I could just just follow up a tiny bit. I think that makes sense. My question would be for when you do have a lot of control flow. Yeah. and you're trying to you're trying to at least, you know, be able to access everything in your lane before you you you branch off somewhere else. Having a contiguous instruction stream across all the lanes allows you to fill that better, right?

[05:47:45]>> Yeah, that's a good point. Um, that might take a little bit. If you would like, I'd love to chat with you after.

[05:47:50]That might be that's a longer uh but but very interesting thought. I think this is a cool idea.

[05:47:54]>> Yeah, >> we can take more question.

[05:47:57]>> Yes.

[05:47:58]>> One more one last.

[05:47:59]>> Yes. Um very briefly uh I'm not entirely clear how you're handling control flow.

[05:48:06]So you're saying only lane zero can handle a control flow. No problem. In VW often only obzero can handle control flow. But in a VW I only have one program counter to set. Here I have four or eight to set. How does it work?

[05:48:20]>> Okay. So short version um I'll come back really quick to uh let's go to the hardware and corresponding software. So the short version is is lane zero is the only lane that has access to the BTB.

[05:48:31]All right. So lane zero here uh if I access the BTB and let's say I I assume or I predict taken branch a taken branch there is we're going to broadcast that information to each lane and lanes say this is lane one. All that's going to happen is there's going to be incoming signal from lane zero here. All right.

[05:48:51]And what's what might happen is if I predict taken in lane zero, all other lanes have a instruction called the prepare bench instruction that before any branch is taken, they'll all get one. And if a branch is taken that says, "Okay, now copy the P the PB register into the PC and just start executing there naively. You don't need to know why. You just need to be told to do it."

[05:49:15]>> Okay, got it.

[05:49:15]>> Does that help a little bit?

[05:49:16]>> Yes, of course it does. Thank you.

[05:49:18]>> Glad to hear.

[05:49:19]>> Let's thank our speaker once again.

[05:49:21]Thank you.

[05:49:24]>> All right. Our next speaker is Louis from Mechill University. Uh he's going to talk about his research papers >> sir a small IR for HLS with parallel patterns.

[05:49:48]You're ready.

[05:50:15]>> Is it okay here?

[05:50:44]Hopefully it'll be okay.

[05:50:48]All right. Uh good afternoon everyone.

[05:50:50]Thanks for the introduction. Uh yeah, I'll be presenting my work on designing a small intermediate representation for highle synthesis uh with functional parallel patterns.

[05:51:00]All right. So uh since I'll be uh describing a new language, I guess I'd better start by um clarifying what sorts of programs am I looking to express with this language. Uh specifically, I'm looking to describe streaming accelerators that run on an FPJ.

[05:51:14]Um now FPJs are interesting because they're a nice middle ground between um application specific integrated circuits and general purpose processors like CPUs and GPUs. Um on one hand you can you have the flexibility to really customize your hardware for your specific app which helps with you know power efficiency and uh performance and so on.

[05:51:32]But on the other hand you don't incur the cost of manufacturing your own custom chip. You just reprogram the device.

[05:51:39]Um now let's say that somewhere on this FPJ you have a component that performs a dot product. Um and for illustration I'll just use uh you know these simple inputs 1 2 3 4 and 5 6 7 8. Um now in a streaming accelerator these inputs are provided piece by piece rather than all at once. Um so for example maybe at the first clock cycle we provide the inputs 1 and five and then at the next clock cycle 2 and six and then 3 7 and then 48 and then later we should get the answer which is 70 in this case.

[05:52:13]So what options do we have for u designing these sorts of streaming accelerators? Well, traditionally you would use a hardware description language like VHSL or Vera log um which gives you a lot of control over the result but it's very low level and verbose right it's kind of like writing your software in assembly like maybe sometimes we need that control but hopefully there's a higher level option right um and indeed there's been a lot of work on uh high level synthesis especially on converting languages like C++ to VHDL and verilog um this is and and this sort imperative um programming model that you have in C++ with you know mutable variables loops dynamic memory allocation uh that makes a lot of sense on a CPU but not as much sense if you're trying to design a circuit right so what I'd like to focus on uh for this talk is another approach in which we use uh parallel patterns from the functional programming paradigm like map and reduce and zip and so on um in order to build up these streaming accelerators All right. So, uh let me explain this uh functional style.product program uh in detail since I'll be referring to it throughout the presentation. Um we have two inputs here, U and V. And these are both streams of 16 bit unsigned integers. Uh which means that for each of these inputs, we'll be receiving one 16- bit unsigned integer per clock cycle. Right? Um, and they're both they both have a length of four. So I can reuse the same, you know, example inputs 1, two, four, and five, six, seven, eight. So probably the first thing we want to do is we want to combine these two streams of numbers into a single stream with pairs of numbers, which we can do with this parallel pattern zip.

[05:53:58]Um, so in this example, you know, one will be paired up with five and two with six and so on. Next, we want to multiply each of these pairs of numbers, which we can do with this parallel pattern stream map. uh stream map just applies a function in this case multiplication to every element of the input stream. So we should get as output from map 1 * 5 2 * 6 3 * 7 and 4 * 8.

[05:54:22]And then finally we want to add up all these numbers which we can do with the parallel pattern stream reduce. Um reduce just combines a whole sequence of numbers into um a single number with some sort of aggregation function in this case addition. So as before the result should be 70.

[05:54:40]Um and there's a very natural correspondence between this you know sequence of parallel patterns and a three-stage pipeline where the first pipeline just sort of zips the two streams. The next stage does the multiplication and the last stage uh you know performs this accumulation with an adder. Right?

[05:54:59]But now you might be wondering hey Louie why why should we have I jumped the gun.

[05:55:03]Uh or no it's fine. Um, okay. No, it's fine. Sorry. Uh, yeah.

[05:55:13]So, you might be wondering, hey, why do we have this register in the first pipeline? It's not in the first stage.

[05:55:17]It's not actually doing any work, right?

[05:55:20]Um, and indeed in prior works like sure for example, you can optimize these programs built out of these parallel patterns using uh simple rewrite rules.

[05:55:30]Uh so for example let's say our language has another parallel pattern called stream map 2 that's basically equivalent to zip followed by map but it does it in a single step. Well then in our compiler whenever we see zip followed by map we can rewrite it or in other words just replace it with a call to stream map 2.

[05:55:47]Um and that will have the effect of um combining our first two pipeline stages which eliminates the register. Right?

[05:55:55]Cool. So with this uh functional parallel pattern approach uh you know there's a nice correspondence between the source code and the hardware um and the compiler can kind of optimize the programs using these simple rewrite rules. So then what's the drawback other than I guess people being unfamiliar maybe with these parallel patterns.

[05:56:14]Well, one drawback is that if you want your language to be useful, you need a lot of these parallel patterns, right?

[05:56:20]Um, this is kind of a hassle. It's particularly a hassle to add new parallel patterns to your language because you need to um, you know, implement this operator in VHDL or verog or whatever your target language is, which is kind of a pain. You also need to explain how does this new parallel pattern interact with all the previous ones, right? How do I fuse this new one with map? How do I fuse it with zip? how do I fuse with reduce and so on. Um, and also if you think back to the previous example where we, you know, fuse zip with map to map 2, uh, if you don't have map 2 in your language, you're out of luck. You just can't implement that rewrite rule.

[05:56:58]Right?

[05:57:00]So, what I'm going to show in this presentation is that you can express um all the operators that I listed on the previous slide with just three stream related primitives. Um and also that this is a useful thing to do. Uh besides simplifying hardware generation and optimization, you actually get more resource efficient results at the end of the day.

[05:57:24]All right. How are we going to do this?

[05:57:26]Um well, we can take inspiration from prior works on array programming uh which used or which expressed array computations with just a very small IR.

[05:57:36]uh specifically there's one constructor for arrays. You can build an array of length N where the element at index I is given by some arbitrary function F.

[05:57:46]And there's also one extractor for arrays. You can ask for the element at the index I in some existing array, right? Um and using these two building blocks, you can express all kinds of higher level um operations on arrays. So for example um the E element in the output of map is just F applied to the E element in the input array V and likewise the E element in the output of zip is just the E element of V1 paired with the E element of V2 right so now I'd like to use the same idea but for uh describing streaming accelerators right so in prior works you know you've got this input language with you know stream zip stream maps stream reduce the compiler applies rewrite rules to try to optimize this and then prior works basically just translate this directly to a hardware description language um what I'm suggesting is that instead we should add a new um intermediate representation a small intermediate representation which I'm calling co uh to this compilation pipeline so we first lower the higher level language down to this intermediate representation and then we apply optimization and hardware generation on the smaller IR which is easier um and in this IR are there is um like with the array programming uh works there's one constructor for streams called s build there's one extractor for streams called s data which just gives you like the next element in the stream uh and there's also a mechanism for sharing a stream between multiple consumers all right now one more thing before I get into some examples of using this IR I'd better clarify um what kind of hardware uh we're trying to generate so specifically We're looking to generate a pipe or yeah a pipeline in which stages communicate with this handshake protocol. So every stage sends n bits of data to its consumer along with the signal saying whether the data is currently valid.

[05:59:45]Conversely, the consumer sends back a signal saying whether it's ready to receive that data. Um if the consumer is not ready to receive the data yet, the producer had better not move on to the next element in the stream yet.

[05:59:57]Um and within each of these stages you might want to have some internal state like you know the the accumulator inside of stream reduce right the running total.

[06:00:07]All right let me get into some examples.

[06:00:09]So probably u one of the simplest programs you can write in this IR is just building a stream of length four in which the output is always 42 and it's always valid right. Um this expression evaluates to the stream 42 42 42 42. Um and the hardware for it is very straightforward. you're just constantly sending the value 42 into the pipeline register.

[06:00:33]Yeah. What if you want to have some uh internal state in in your um in your stage? Well, we can build a stream of length four where the output is uh instead of 42, it's i.

[06:00:44]And here I is an accumulator. It's a twobit unsigned integer where the initial value is zero. And the value at the next time step will be the current value of i plus one. Right? Um in this case this expression will evaluate to uh zero which is the initial value one which is the next value and then two and then three right um and in hardware you'll have still the pipeline register at the output of the stage you'll have register for the accumulator I have an adder that increments the accumulator all the time right hopefully it's okay so far um now let's say you want to have a stage with uh some inputs that reads data from another stage right so like stream map.

[06:01:26]Um well in that case you can declare a producer called P. Uh continuing from the dot product example we'll say that P is a stream of pairs of 16- bit unsigned integers.

[06:01:38]Uh the stream itself can be some like any stream expression. It can be an S-build expression. It can be a variable whatever. Um and in this case for stream map we're always ready to receive the next element.

[06:01:51]the output of stream map then um again in the dot product example where we're doing multiplication um you'll compute that by reading the data from the stream from the producer and then multiplying element zero by element one right and the output is always valid um and in hardware what's going to happen is that uh the data from the producer stream is just going to be fed right into the multiplier and the output of the multiplier will be sent to the pipeline register right um stream zip is very similar in this case, you'll have two producers. Um, again, we're always ready for both of them. And the output of zip is just the data from the first producer paired with the data from the second producer.

[06:02:31]Right? Now, you might be wondering, hey Louie, uh, why is the output always valid and why are we always ready? Like, if one of the producers has validated but the other one doesn't, don't we have to wait? Um, great question. Um so if you're using VHDL then indeed you would have to take care to you know carefully follow the handshake protocol so as to not you know drop elements from one of the producers. Um but in the IR you can actually kind of pretend that as soon as I as soon as the ready expression value is to true the data will arrive immediately and then the SQL compiler will take care to insert some extra logic in the hardware to make sure that um you know a logical time step doesn't happen until the data is actually valid right which is I think quite quite a useful abstraction all right one more example to put everything together uh stream reduce so as before we have one producer or p always ready for it. We have an accumulator for the sum. Uh it starts at zero and the value at the next time step is the current value of the sum plus the data from the stream. Right? That's also the output of reduce. Um but when is when does reduce have valid output? Only once we've finished reading the stream, right? So we can add another accumulator I um which starts at zero and it counts up at every step. And the output of stream reduce is only valid once I is about to hit four which is the length of our stream right um in hardware again there will be you know the pipeline register there's going to be a register for the sum um and there's going to be an adder to to update the sum um and here I'm just omitting the register for i along with all the extra handshake logic that gets added for the the valid and ready signals that I mentioned on the previous slide um just for completeness here's the full uh dotproduct example so we've already seen uh two slides ago, how to implement zip.

[06:04:24]There it is again. This is used as the input for map which we saw three slides ago and this is used as the input for reduce which we saw on the previous slide. So nothing new here just kind of putting everything together.

[06:04:38]All right. So I've seen so we've seen how to uh implement these higher level you know parallel patterns like zip and map and reduce in this IR and you know hopefully you have some intuition about how it translates to hardware. Uh but now what about optimization? Can we still do things like uh fusion for example? Um and the answer is yes.

[06:04:56]Otherwise I I probably wouldn't have brought it up. Um so if I've explained myself clearly so far, hopefully you can imagine how instead of having these two producers pairing up their data and then multiplying them later, we can just multiply the data immediately, right? Um and that looks like this, right? So two producers multiply the data. There you go. Um, and exactly as before, this will combine our first two stages into a single stage um, and get rid of the first register, which maybe we don't want.

[06:05:31]All right. Um, now, is the hardware generated by SEO actually any good? Uh, well, to evaluate this, I compared the SEO compiler with three prior works.

[06:05:41]There's the Intel HLS compiler, which lets you write code in C++. Um and then there are two uh research compilers for u you know this functional hls. There's sure and there's etherling. On the x-axis you can see various benchmarks ranging from you know small examples like a dot product up to larger examples like matrix matrix multiplication uh so filter image sharpening stuff like that.

[06:06:03]Um and on the y- axis you can see there are two measures of hardware quality.

[06:06:07]There's the number of adaptive logic modules that tells you you know how much area is your circuit taking up how many FPJ resources does it consume? And there's the latency, the number of clock cycles from the first input till the very last output.

[06:06:20]Uh if we look at the latency first, you can see it's pretty much the same for every tool. So in other words, SQL is not degrading the performance in any way.

[06:06:29]And if we look at the number of ALMs, here it is for Intel HLS. Um here it is for the functional HLS tools. So you can see, you know, sometimes they're doing better than Intel HLS, sometimes they're doing a little bit worse.

[06:06:41]And here it is for SQL. Um you can see that in general it's doing better than any of the prior tools. Um and in fact it's it's really not obvious because it's on a log scale. Um but on average it's actually using um a lot much fewer ALMs than the prior works like 76% fewer than Intel HLS. Um 60% fewer for the other tools which I think is kind of cool.

[06:07:04]All right. So long story short, by adding this, you know, extra IR su into our compilation um pipeline, you can still express all the same uh highle programs as you could before. It becomes easier to do hardware generation and optimization, right? Because you have a smaller language to deal with and you actually get better or specifically more resource efficient results at the end of the day. Uh if you're interested in hearing more, uh I encourage you to take a look at the paper. Uh you can try the compiler for yourself. It's it's all open source. Um or you can ask questions right now.

[06:07:47]The program chair taking priority.

[06:07:49]That's not good. Yeah. Thanks. Very nice talk. Um I was wondering how does it compare to stream HLS like that work from from Shiru Shang and and Jason Kong. So they've been also working sort of DSLs >> uh for streaming applications down to HLS tools.

[06:08:09]>> Yeah. Uh oh man, I read about this a while ago, but I forget what the >> okay >> what their approach was exactly. So I'm I'm not sure. Sorry.

[06:08:17]>> Yeah. So so I'm wondering maybe um so their approach is similar to what like like companies are doing maybe building infrastructure with ML, right? where where the product is sort of a name operation that you know what it does.

[06:08:32]>> Yeah.

[06:08:32]>> And so they have >> they build optimizations from those um ML named operations.

[06:08:39]>> Yeah.

[06:08:40]>> So it's it's so it's a it's a different approach because then you have so the IR is say richer because it does not encode a a schedule like I think your IR scales encode a little bit of a of a way of putting things together. Right.

[06:08:56]>> Yeah. basically so I guess in in my approach you can kind of simulate um the IR to get yeah basically a cycle accurate >> exactly >> idea of what's going on >> whereas here is more say maybe declarative in a sense right because you have some name operations you have a convolution 2D with some stride and out of there you can generate streaming based accelerators so I was wondering how that compares but yeah maybe we can take that offline >> yeah sure I mean I I guess with this approach like again we're still starting with these sort of high level operations so you can um yeah add all these you know high things like zip and map and you could totally add a something like stream con 2D um which you know the programmer doesn't have to worry about how it works and then the compiler writer takes care of implementing that um which I guess eventually has to happen with stream hls as well right >> yeah exactly >> yeah thanks for the great talk uh this is from MIT I had a question about your results um did you measure the number of registers that you use up in your design or is alm I imagine is a measure of logic like lutz equivalent essentially Yeah, great question. So in so I'm doing all this on chorus and my understanding is that in chorus in ALM includes registers and logic >> and it doesn't break down for you the logic versus register.

[06:10:10]>> I can I I think it does show the number of registers. Um but I I didn't record that. Sorry.

[06:10:18]>> I see. Because traditionally tools like Etherling generate a lot more registers.

[06:10:22]So that's the that's the actual trade-off they end up having. they have more registers and therefore um less logic often times. Uh for your fmax, did you pin it to 175 for all the designs or did you do a search for the fmax on all of the designs based on the slack you had left?

[06:10:39]>> Um so I just used 175 in general because that's what Etherling used for their evaluation. um in general a lot of so I I didn't really search very carefully but from what I've seen like Cortis' estimates um a lot of the designs hit you know at least the you know 200 plus um you know yeah so it's yeah the fmax could could definitely be better um you know by inserting more stages and so on um but that I didn't do a lot of work on that >> understood and the latency is is wall clock time with respect to 175 MHz and the number of cycles.

[06:11:15]>> Yeah, sorry. The latency is in the latency is in clock cycles.

[06:11:20]>> I see. Okay. Thank you.

[06:11:22]>> Yeah. Sorry, I wasn't maybe it wasn't clear.

[06:11:26]>> Any other questions?

[06:11:29]Maybe I can have one more question. Very last question. Um I was very surprised because you said every operator could be represented with this syrup. So what was the uh the most challenging operator? um you uh you you like to build with your um syrup.

[06:11:54]>> I guess one that's maybe a little bit tricky is uh I don't know if I have it here. Yeah, repeat. If you want to repeat a stream, you kind of want to um kind of read the whole stream into memory and then read it back out from memory. um which yeah I had to do a little bit of work there to get uh quarters to kind of recognize um the the VHL code that I generated as something that can be a memory block. Um but yeah I hopefully it's kind of intuitive why um you can express all these operators in SEO like you kind of have all the ingredients that you need right you can read from some other stream you have some you know registers basically they can all be updated in basically arbitrary ways um in fact fun fact this language is this IR is turn complete >> um it's kind of cool >> yeah thank you >> hopefully that answers your question >> yeah sure any more questions let's thank to our speaker once again Our third speaker is um Zoom Hung Jang from Miel University. Um this paper is uh one of the uh best paper candidates.

[06:13:07]So um it'll be interesting and Jun Hong is going to introduce his research paper a functional approach to synthesizing roundable programmable accelerators for neural networks and he is in the final year PhD student and he said he's in the job market. So if you are interesting please do not hesitate to reach out to him. Uh, whenever you're ready. Floor is yours.

[06:13:46]>> Sorry, the screen does not show up. Oh, okay. Yes. Okay. Uh, just this.

[06:13:59]Okay. Yeah. So, so thank you for the introduction. So, I'm Zonghan. So, so today I will talk about a different approach compared to the previous talk.

[06:14:08]So, we are focusing on about functional programming and high level synthesis and with routable and programmable hardware.

[06:14:19]So, our target is APJ devices. So, you may ask why APJ? So you can see we can we have several accelerated like CPU, GPU, APJ SIC but for SS and APJ is typically very efficient especially for the power consumption because they we can customize the hardware architecture inside that but another problem for SPJ is there are a bit they are difficult to program compared to CPU or GPU because especially for S is it's a fixed architecture and for APJ you need to write the HDL first to configure APGA.

[06:14:54]So that's the problem we want to handle now. So we because APJ is low power, low B and it's hard to program. So basically this is the problem we want to solve how how to program. So let's look at the example here. So we have the we have the element sum for two arrays and this is a Python code. So this is very simple, right? We just get each element from two array one by one and sum it together.

[06:15:19]But how do we do the same thing in HDL?

[06:15:22]So this is another example. So you can see it's it's verose. So we need to talk about okay we have a clock and we need to count the number with with a register and then we need to handle the handshaking protocol. So it's much more complex. So that's a reason HDL is more difficult to program and people want to use high level synthesis to generate the the code instead of writing the doable language. So the a very typical method is H acts especially for Cbased. So there are very several tools and Intel's tools but these kind of tools they they can provide very simple program like this. So you can see this a C++ example for vector addition for vector addition. So but the problem is if you fit this into layer compiler directly you will get how the performance is pretty bad. is because we need to optimize them manually. So what kind of optimization we will do is something like okay we need to insert a pragma to say okay we want pipeline here we want to unroll here but all these thing are related to hardware architecture so it's very low level and another problem is usually it's very difficult to guess what's happening at the hardware level because the tool it works like a black box so you have no control you have you cannot predict the performance very accurately and that's a one of the reason we want to go this functional approach. So this is the previous as mentioned in in another talk before. So we we have several kind of primitive like map reduce and basically is to apply a function for example max apply a function to a sequence of input and we have zip to group two array together one by one and at this level we don't have the information about har interation. So the later stage what we will do is we will for example for map we will lower it to a new IR system and then it has more information about how like okay data comes in a streaming way.

[06:17:24]So in the hardware we know exactly clock cycle z we process the first element clock cycle one we process the second element and it's it probably is not good enough if we want it to be finished very fast we probably can further paralyze it. So we have another version that map vector. So it's basically okay we create several instance of function and then process it all at the same time. So we know at time zo we just process all at at once.

[06:17:50]And this is the compiler step we have.

[06:17:53]So it's also for privacy work. So we have something for very algorithmic to express the algorithm only. So we don't have hardware details. So we basically use the same example for array s. So sorry for vector sum addition. So basically we group two arrays together one array together and then do elements addition one by one and then we will lower and optimize the code and then we get a more paralyzed version and with this we exactly know what what's happening at the hardware. So we know this clock cycle what's happen cycle a certain things happening. So that's easier for us to do co modeling or some kind of optimization but this is just the start of the story. So the real problem is if we have application that neuron was how do we fit into a device like a PJ. So this is an example for VG16 and we will start with something simple like okay we look at three layers of land and these three layers not exactly in the same shape and we now have a device FPG device and FP device usually it has a limited amount of resources and multipliers the most expensive one. So as soon we have this amount of multipliers and want to feed this layers into the device then what will happen? So the most naive way is okay we probably reserve a certain multipliers for one layers. So for this sample we reserve most of the resources for layer one but we will run off resource for layer two because we just exclude this multipliers for a certain layer. So that's not what we want. And the more ideal approach is okay right now we we we can see this all these layers are less and they use the same amount of multiplication. So why don't we just share them across all the layers together to actually we can squeeze the these three layers into a single device.

[06:19:39]So this is what's going on in the prior world. So the prior world they are using that binding to to share a function and then use function code to handle handle each functions asset. So this is the example. So we have a function is declare at the at the beginning. So it's basically turn a string with an element is input and then output is a string of an an element as well. So we have a sure function f and then usually the data comes from memory interface. So what we will do is and we want to call this function several times. So what will happens? So the first thing is we will have a address we will we will we will get the address for the read data and the writing target for another address.

[06:20:23]So T1 is the our writing target. So basically what we are doing is we will combine map map stream with read to read the data one by one and then we will combine reduce and write to write the data one by one back to the memory and then in the middle we have the function code and it just grab the data from one source and then process it and then store the data to the destination. So this is the first function code and if we want to call it multiple times it will be something like this. So we have we want to use the data from T1 and the partial partial partial ratio from T1 is safe to target the destination T t2. So you can see we read T1 again and then do another function call and save to T2 and then we will do the same thing for the S function call and save the final result to out. So this looks simple but what's happening in the hardware is it will be more complex and it's the problem from previous work. So we will talk about what will actually happen on in the chip. So we have a share function and memory interface and then we have a data initial data in the memory. So for the first function code what is happening in the tree is we will allocate a space or the resources and this resources including this pass this blue line and then the block that function call one and then the red reduce and map and read this all these thing will occupy a space on the chip. So you can see there there's a problems because if we have multiple function codes so it will occupy another space and a set of function code will occupy another space.

[06:21:54]So you can see it's actually scale with the number of function codes we have and if we have something that okay rest net 50 it has 50 layers. So it means we have 50 function codes. So we definitely cannot fit into device then how to solve this problem. So we definitely cannot go with the model with function code style. So right now we switch to an primitive actually just seen. So it's reduce. So because reduce usually people use that to do array sound to sound to accumulate even one by one like this one. So basically we take aggregation function and it as as one of input another input is the streaming data. So right now what we want to do here is we we we still target at the same function and then we want to make this reduce to iterate over different different things. So it's actually the memories me the memory in the the memory address we have. So like this like this we right now we have we want to we we read the memory base memory from the from the memory space and then it actually has two part one is the read address one is the right address. So we read this and it will become a string of three elements and each element contains read and write at the same time and we will iterate over this in to model the function code. So we will see what's happening later. So right now we have a reduce and it actually set this instruction as the input and it will iterate over that and the the things here is we will need to feel out what's inside. So because basically we have a function and then this function will asset the input read from the memory and write and it will has a write primitive to write the data back and then the these two scene read and write you will asset the base address from the memory. So it's it it's similar to the instruction in CPU but for in each instruction it will take actually several thousand cycles to process because you know the target is convolution fully connected or something similar.

[06:23:59]So based with this we can you can see the ent the architecture inside the chip is much much concise compared to previous one. So we can actually make things more compact.

[06:24:11]But another problem is right now we only talk about okay we these three layers are the same but what about we want to we want to handle two layers they have different shapes. So if we look at the IR or the primitives we have you can see if a fun if a function take a string as the input it has size n it means it must all all the time it must take size n for example 100 size 100 it must take a string of 100 all the time it cannot take a string of 10 a string of 20 and the output is a string of 100 as well so it's not flexible so as so so it also happens to the the primitive map string so it must take a string with a fixed number it cannot be changed. So right now if we have two inputs that the string with size 28 and the string with size 14. So it means we must to determine the size right now. If we choose the large one it means we need to somehow reshape the small one like padding we make it larger but it means we waste the computation because because this padding 14 values are actually doing nothing and if we choose the small one 14 as the final size it means we need to somehow tile the larger one. But all both of these are not efficient at the hardware because it means you need to create some extra components to handle the shape.

[06:25:27]So what we want to do is to introduce a new type. So right now we only have a stream main type. It must handle a fixed size string. And then right now we have something new is called upper bounding string. So it means we just give you upper bound but the real string size will be determined at wrong time. So it means if we determine a up and down syn with 100 elements it means when you really run a program it can be one something between one to 100. So and this is something we want to introduce to handle the issue of different shapes.

[06:26:01]So let's go back to the the simple example. So right now we just make it a little bit harder. So right now you can see the function it t as the a string with size uh is as input and output. The opposite shape is string with size n minus one. So right now you can see we we start to make the shape a little bit different. So what we will do first is okay right now because we have this new type. So we can actually make instruction more flexible. So you so first we put we allocate space to to tell us okay how many instruction we want to have. So in this sample we put three but actually you can put one. So you will just do one one one commands at a time. And then what we want to do is okay we we have a reduce and you can see the reduce is and the map there are new version there are reduce upper boundary string and map upper bounding string and what what we have right now in the memories is okay we still have our base address but right now we will have the new thing because we this lens we still have not determined length.

[06:27:08]So at the next step we will fill it the function we will call a function and then feed the data from read and write the data with write but both of them it taps the base address but the problem size is changed so we also have something new is called u counter so it's a upper part string counter and it will asset a number from run time to and this is actually the final lens it will produce yeah so you it means okay sometimes we can produce a minor size and sometimes we can produce n minus one and then you can see here in the instruction encoding we need to include this information as well. But this this help us to handle okay right now we have several convolution layers the shapes they are actually different and then we we fix another problem for the shape and now we still have a problem because you know new level have several type of layers. So in this simple example we we we still have one more layer is fully connected layer because the handle convolution layer handle fully connected layer might be different the optimization target will be slightly might not be the same. So how do we handle this? So let's look at the so this is something from the previous slide. So we we have this architecture we have one single function there but right now we want to introduce a new function. So we have function one and function two. So how do we handle this?

[06:28:37]So it's and we use a new primitive call switch apply. It's similar to switch but it it has a pool of functions and then a input with certain type and then we have a selection signal to say okay which function we want to go with and with this we can introduce a new function. So we have function two and function one and then just just rep them out with switch apply. So solution apply now text two function f_sub1 f_sub_2 and then we have selection signal. So you can see here in the extraction we have a up code. So up code is basically okay determine which branch you want to go with. So with this we can make the accelerator to handle two two or two more than two problems at a time and this is basically our approach. So we will talk about we will show you a example about how is this expression looks like in in our real test case. So we want to show you the the net five because it's simple it's not too long.

[06:29:37]So basically we want to write the data back to somewhere else in the memor. So we have this right address and for the neural network usually we have several layers and we want to fuse that to get a live convolution layer is followed by relu quantization pooling. So the last day we for example we have a pooling there and you can see we have a switch apply here to to say okay we want to go with identity do nothing or just do ping and you can expect the shape are different but since for our upper boundary stream type then we can actually merge them together without problem and then we have bias and quantization. So basically it means we need to read something from for bias address and then control the quantization big width and and then we have the main computation. So it's basically okay we have because you know convolution and the fully connected layer they can be decomposed into a d product. So we have a product unit and then partial sum if and then to do partial sum if the layer is too large we need to somehow tile that and then we have the component to read the address for to read the data for image and wave for convolution and the same for for the fully connected there and then you can see we we somehow twist the the read address generator for re for convolution and fully connected layer is because they their shape are quite different. So we need to do something more. Okay.

[06:31:04]And then we evaluate our approach on area 10 FPGA. So we first show you the the previous world. So in previous world what they are doing is they use the that best sharing and use function code. So what they are trying to do is okay they have a fully they have the convolutional BG16 and then they synthesize it with a certain amount of DSP. DSP is a number of par parallel multiplication we we can allow and gaps is the performance number and P routing is the maximum routing conestion inside a chip. So you can see if they don't use the they they do not use too much DSP it's it's routable it can be synthesized but if they use too many like more than a thousand the rout the P routing congestion is over 100 so it's not synthesizable but with our approach we can actually synthesize that with the same amount of DSP of course we sacrifice some because you know we need to read the read the instruction first we have sacrifice something to and the performance to be tall.

[06:32:06]>> Okay. And for >> and for >> Okay, I will go. And for tiny yolo and another example is tiny yolo. For tiny we actually get a better performance because tiny yolo we have tiny has a larger input size. It means the each layer might have quite different sizes and because we introduce upper body string it means we can handle different shape more more properly.

[06:32:30]And the next next one is reset 50. So this is a large example we have and we compare to also compared to against a prior we just want to show okay we can perform better than somebody also use area 10 and then and then you can see you can see if if even we use different number of DSP the maximum number of routing is not changed too much so it means our approach is is kind of scalable and the next thing we talk in in our title is programmability because we mentioned that so Okay. So, sorry. Okay. Yeah. So, what we have is we have a accelerator and it can support two different they two different type of neon networks at a time like VG16 and VG 11. So, I will jump this and then and then yeah for conclusion. So conclusion we have reduced upper body string and solution apply to push things together to make things programmable and routable. And then this architecture is not new but our contribution is to to express this behavior with functional and that's all. Thank you.

[06:33:38]>> Yeah.

[06:33:42]>> Okay. And it's even a comment I have to make not a question. So um you have put uh put in a lot of interesting work on handling with these handling these streams. Um but you have been mapping them to FPGAAS.

[06:33:55]>> Yeah.

[06:33:56]>> I wonder uh why have you not used devices that support these streams natively like uh Xylinkes or AMD these days uh Versal FPGAAS which have this array of uh um 400 um VIW AI engines uh with hardwired streams between them. So I I think this would be a brilliant programming approach for such a system and it would be far more efficient than targeting FPGAAS. You can run these AIES at over one GHz.

[06:34:28]Yeah. So this is also a problem we facing before is because we need to compare against different different system. for example the the system you mentioned is I I think it's it's suitable for this problem and and then yeah because because we start our work with APJ so let so we work with we we are also considering to extend them but it it needs time so right now we are thinking about okay we we have our Intel device we are switching to that device but we are still working on something that because their interface are different so we need some other works to do that so that's a reason we don't talk about this here yeah but that's a very interesting point Our last speaker is uh uh he couldn't make it attending this conference in person.

[06:35:18]So he's going to deliver a talk on online from my understanding. So it is ready.

[06:35:24]>> I know >> the author is going to take the questions after the talk. The recorded video is going to be played on this screen >> online. Yeah, he is in the online. Yes.

[06:35:37]>> Uh, okay. Uh, good afternoon everyone.

[06:35:41]My name is Xiao Ya from the Institute of Automation, uh, Chinese Academy of Sciences. Today I am pleased to present my work loop hint a compiler assisted loop branch predictor for embedded DSPs.

[06:35:57]Uh digital signal processors uh known as their names are dedicated to proc as digital signals. The processors are application specific architecture. We study the uh stereotypical workload of digital signal processors and found that uh the application is are highly loop intensive. Uh our analysis shows that loop uh branch accounts for roughly 90% of uh misprediction in DSP kernels and more than 60% uh in EMB bench IoT applications.

[06:36:37]Um digital signal processors are utilized in real time cases such as ind uh industrialized uh automation and so on restricted by worst case execution time which means deterministic and agile is crucial for designing the system. So frequent prep line flushes due to the branch m prediction is not actually acceptable.

[06:37:05]Um high performance uh processors leverage sophisticated uh dynamic predictors such as tag uh rely on as heavy history uh tables um and speculative updates aiming very small portion of irregular control flow incurring significant power and area uh overhead are often prohibitive for uh embedded targets. uh some are even equipped with uh a dedicated uh loop predictor specifically handle loops.

[06:37:42]However, this components uh require warm up iterations and coverage may fail to track long iteration counts or nested loop due to the table alias and uh capacity um constraints. On the other hand, uh to eliminate branch overhead entirely, many architectures uh implement zero overhead looping. It it is actually efficient uh but highly restrictive. It primarily supports simple counted loops and often um fails when encountering data dependent exits, early breaks, function calls or complex nesting uh due to the uh control flow changes and this motivates a new solution. Uh can we exploit compiler known loop semantics to provide deterministic prediction without runtime learning? Uh they actually identify for deterministic loop patterns uh that can be easy uh discovered by compiler. Uh the first type is the constant bounded loops uh is bounded by a constant number.

[06:39:04]And the type two is invariant variable bounded loops and it's bounded by a invariant variable. Uh and the third type is computed invariant bounded loops and the fourth is the memory invariant boundary loops. Um the four uh the figure provides concrete example of those four categories uh directed from expressive DSP library and AM uh EM bench IoT uh benchmark. Uh while certain benchmarks such as the uh convolution uh float 32 or the FIR float 32 um are dominated by specific loop types. Other exhibits a more heterogenous uh distribution create uh crucially all four categories are prevalent across evaluated uh workloads and more importantly.

[06:40:04]Uh they are statically discoverable for program analysis.

[06:40:10]Uh this uh observation serves as the fundamentality for loop hint. Um a mechanism designed to extract this predictable patterns at compile time and uh eliminate their associated misprediction penalty in the hardware.

[06:40:30]So uh this leads to our system uh loopint cross the boundary of software and hardware. It consists of four parts and uh they are compiler analysis uh loop and ISA encoding uh hardware predictor uh and the runtime update. So let's uh turn into the first uh compiler design. Uh we use LVM uh scalar evolution analysis to an uh identify the structured loop uh described above.

[06:41:06]While SCEV excels at linear transformation, it often struggles with uh long nonlinear update patterns such as bit shift. Uh so we uh to address this we implement a heristic pattern matcher that inspects the loop latch block for specific shift based ind uh induction update. Once the loop is validated as predictable uh its structure comp uh comprising the loop bound the step type and step value is encapsulated into a custom intrinsics and then inserted into the preheader. Uh the next step is to uh lower the the intrinsics into the real uh real instruction. Um uh one uh to uh to ensure that uh the loop hint metadata survives the back end uh optimization without interfering with register allocation. We def uh defer our final instruction generation to the machine SSA stage. The transformation follows a three-step pipeline. First uh is the instruction selection. uh the I intric are uh initially lowered to sui instructions. This sudo instructions act as placeholder that uh carry loop metadata through the early backend passes and the second step is the um machine SSI uh SSA expansion and branches association. Uh the sud instruction is expanded into actual uh architectural loop hint instruction which are branch style and can reuse subsequent branch lowering pipeline. These instruction are physically p placed into the uh loop pre header a dedicated path associate the branch like loop hint instruction with the corresponding uh branch by traversing the control flow graph. The transformation uh is critical for calculating relative um branch offsets.

[06:43:23]And the last step is the branch lowering. By reusing the standard branch lower pipeline, we ensure that loop hand instruction accurately encodes uh the target branch distance and extract uh extracted semantics allowing for similarly uh integration with existing hardware prediction logic.

[06:43:49]So uh the the second part is the ISA design. Uh we implemented loopint by extending our uh the custom uh risk five ISA. Uh we proposed two distinctive uh instruction for format. Uh the first one is the loop hintter uh branch. uh as is showing on the figure it's a is a standard B type risk the loop bound rag and the loop iter rag uh occupies the RS1 uh register and R RS uh two register encoding space to indicate the bound value and uh the initial value of the loop eater.

[06:44:40]uh and the 12 bit uh offset immediate uh encoding uh indicates uh its association with the corresponding branches and the second is the loop in the step uh we designed a uh 20 bit in inter inter immediate uh type uh uh step val and uh it's uh compressed version CIW type uh 12 bit step. Well, this uh this encodes the um this encodes the uh the con constant step value. So uh these are two decoupled the dry instruction coding. uh if uh we have a severe uh restriction on the code size, we can use to fuse a fuse single instruction encoding. Uh this is the trade-off between the code size and uh the the coverage flexibility. Um so the uh so the choice between this two kind of in coding in involves a trade-off between the code density and architecture flexibility. The uh fuse approach minimize the instruction count but impose the severe constraints on the branch offset and step value since they uh share the uh 12 bit immediate uh encoding. Um so we ina uh investigate uh the max offset and the max step val across benchmarks and found that the draw in instruction encoding is more robust to embench uh IoT benchmark and all can fit into a compressed version of encoding as a more aggressive alternative. Fusing single instruction can use the coverage to trade for the po perfect code size and this is uh uh determined by by the users.

[06:47:00]Uh now we turn into the hardware part.

[06:47:04]Uh a lightweight uh loop table stores loop bounds, iterators, tax and prediction state integrating with the existing branch uh prediction unit sharing the same data path and update logic to minimize hardware overhead.

[06:47:19]When a loop hint instruction is encountered, the loop predictor creates an entry in the loop table uh with the information encoded in the instruction.

[06:47:29]The loop predictor then use this information to predict the outcome of the associated loop br exit branch when it is fetched. As for update uh when the uh loop iterator is automatically updated according to the state uh step type and step value in each iteration.

[06:47:51]Uh another highlighted feature for this loop branch predictor is the nested loop and the early uh break uh uh solution.

[06:48:02]The loop predictor is in implemented in the stack based LIFO structure which naturally supports nested loop by maintaining multiple entries in the loop table each corresponding to a specific nesting level. uh and uh early breaks are resolved through two uh specific uh scheme to ensure efficiency.

[06:48:30]Uh the first one is the stack invalidation. If a early break targets an outers group, all stack entry between uh the current stack pointer and uh uh the the matched loop tag are marked invalid and uh all between uh the entries all between become st. And the second uh approach is the age based uh retirement. This is to handle the cases where loop become unreachable. Each entry incorporates and two bit agent counter. The counter decremented when the stack uh pointer resides at the entries level and the refreshed upon a valid target.

[06:49:17]So uh experime experiments are conducted on gene 5 minor CPU use expressive DSP uh library kernel and EM bench IoT uh benchmark switch and the configuration is uh on the right. Um so um loop hint uh as for the result uh loop hint reduces the misprediction by average uh 60 uh7% and often exists 90 uh 98% uh of reduction in misprediction uh in DSP kernels especially for float 32 and fir filter float 32. Um, several other branch p uh branch predictors are evaluated for comparison uh including a simple local predictor and a more uh sophisticated tag based or tag SCL uh configuration uh in the DSP benchmarks. Loop hint exp uh exhibits near perfect prediction accuracy.

[06:50:33]um constantly achieving near zero mpki values in kernels such as uh the SPS fi or core. Um in contrast uh even high-end predictors like uh tag um can can can reports higher MPKI.

[06:50:54]The data reveals that while tag use complex history, it rem it remains sub skeptical to uh aliasing and uh historyless limitation with handling long period deterministic iteration typically of this uh DSP workloads. Uh another uh critical observation in the study is the comparison between loop hint and uh varants. Uh while tagcl is uh specifically designed to uh handle the loop uh loop structure with its loop predictor. The results show that many high highly regular uh benchmarks uh loop hint is either uh close to the uh results of tag SCL or uh just outperform it but with a significantly lower hardware overhead.

[06:51:59]uh a potential observation regarding the result in the table is that the absolute reduction in MPKI for certain benchmark appears rather small. However, in the context of uh embedded DSP uh the relatively improvement and the shift towards dep determinism are far more critical than the average case performance improvements. Traditional trai uh predictors like tournaments or tag rely on probabilistic uh history which can leads to misprediction due to table uh aliasm or um uh or or or efficiency history lens. But uh loop hint by contrast uh converts this uh sto stoastic uh guesses into deterministic fetches by leveraging the explicit loop bonds for real time systems. The elimination of uh uh uh of even a single misprediction.

[06:53:05]>> Hello at a loopation chair. Please cut to the conclusion please. Um we are running out of time.

[06:53:11]>> Oh. Oh okay. Uh so uh uh in in conclusion loop bridges the loop hand bridges the gap between the dynamic prediction and the zol. Uh here are some highlighted uh features and thank you for your uh attention. Uh uh that that's all. Is there any questions?

[06:53:41]>> All right. Hey, this is chair. Um, thank you for delivering a talk. Uh, we are sorry not having you in person. So, we're going to we are going to uh take any questions from now on. Um, any questions from the audience?

[06:54:00]>> I maybe I can ask a simple question. So for the evaluation you said um the you you evaluated uh your scheme with the uh hardware uh branch predictors like tournament uh tage predictors and so >> and but how did you compile such a bench publications? Did you optimize uh the uh the program like have you compiled the program like with 03 optimization or O2 something like that or just compile as you?

[06:54:38]uh the the the uh binaries uh is the same but uh we uh we use uh the different uh uh branch predict uh dictor configuration in the gym five uh with the uh running with the same uh binary.

[06:54:56]So, uh it's it's not optimized for for uh uh a dedicated uh uh it uh is not is not dedicated uh optimized for loop hint.

[06:55:14]Okay.

[06:55:16]uh and and the hint hint uh the hint instruction in the gym five is uh nullify the uh is uh is act as no op uh when the other other uh the other uh branch predictor uh is configured.

[06:55:38]>> Okay. Yeah. Thank you. Any other questions from the audience? Otherwise we can wrap up this uh talk. Uh we can thank you again. Please give another hand please.

[06:55:52]All right.

[07:37:10]Okay. So, welcome to the very last session of the conference. There will be three talks in this uh session. The first talk will be uh given by Yun Jun Kim and it will be about mespec memory aware runtime for adaptive draft scheduling in speculative decoding and edge devices. Please.

[07:37:56]Oh, thank you.

[07:37:59]Hello everyone. I am Unjang Kim from the computer systems and AI lab at Kungpong National University. Today I'd like to present our paper manspec memory a well longtime for adaptive drop scheduling in speculative decoding on devices.

[07:38:18]Let me start with some backgrounds. Most modern larger language models rely on auto reggressive decoding where tokens are generated one at a time. Since each tokens depends on the previous one, generating n tokens require nse sequential target model execution.

[07:38:38]This nature limits parallelism and becomes a major bottleneck for errorm imperance.

[07:38:47]One of the most promising approaches to address this bottleneck is speculative decoding. The idea is that a lightweight draft model first propose multiple candidate tokens and target model verifies them in a single execution.

[07:39:06]As a result, multiple tokens can be generated per target model execution leading to significantly higher throughput.

[07:39:17]However, the effectiveness of speculative decoding heavily depends on the choice of draft model. Domain specialized drafts such as a model find tunes for code or mass uh performs well within their target domains but open degrades on out of domain inputs.

[07:39:39]To quantify the impact of draft selection, we evaluated performance on JSON or nano. First, we compared two approaches, general static and Oracle static.

[07:39:52]General static use the a single general purpose DS model for every prompts. In contrast, Oracle static evaluates all candidate drafts or prime and selects the best one for each prompts.

[07:40:12]Oracle static improves token acceptance by 40.3% on average. This shows that choosing the light draft model can significantly improve speculative decoding performance.

[07:40:28]More interestingly, the optimal drafts can change during generation.

[07:40:33]Oracle dynamic, which adapts drap selection during generation, improves token acceptance by an additional 25.7% over static on average.

[07:40:47]This suggests that static selection is not sufficient to adapt selection during generation.

[07:40:56]Existing approaches rely on line exploration such as merant bending methods since the best model is not known in advance.

[07:41:07]They try different models and observe their performance.

[07:41:13]Over time they learn which ths works best. However, these approaches cause problems on devices.

[07:41:24]On edge devices, GPU memory is limited.

[07:41:27]So only a subset of DS model can reside in GPU memory.

[07:41:33]When exploration selects a non-legant drops, it must be loaded from a slower memory.

[07:41:43]On the JSON or in nano, loading a non-leant drops takes 2.7 times longer than a single drop and verify iteration.

[07:41:53]As a result, frequent drop switching can significantly slow down generation.

[07:42:02]We evaluated MAB based adaptive speculative decoding methods. As expected, they improve token acceptance by identifying better draft selection.

[07:42:16]However, as shown on the lights, a large fraction of execution time is spent on draft loading or pullback execution using a subops model. As a result, the gains from better d selection are largely offset.

[07:42:33]In other words, better direct selection does not necessarily leads to throughput on the devices.

[07:42:43]To summarize, dynamic D switching can improve performance.

[07:42:49]However, on H devices, switching cost are too high to fully realize those gains.

[07:42:58]The key problem is that the best threat may not be immediately executable if it is not legend in GPU memory.

[07:43:07]Existing methods focus on binding the bet strap but ignore whether that trap is actually available.

[07:43:16]This leads to our key insights on Et devices. We must jointly optimize drap selection and d availability.

[07:43:27]To address this challenges we designed the mespec around the two principles.

[07:43:33]The first is proactive draft legency management.

[07:43:39]Rather than reacting after the draft is needed, MEMS pack predicts promising drafts early and keeps them resident in best memory.

[07:43:50]This aligns digency with future demands.

[07:43:56]The second is decoupling drop solution from execution.

[07:44:01]Rather than blocking on non-residents, mespect continues decoding with the best currently resident drafts while preparing better candidates in the background.

[07:44:15]This allows decoding to make progress without waiting for model loading.

[07:44:23]MEMS pack implements these two principles through three components prediction engine draft model manager and longtime controller I'll briefly explain each component first the prediction engine instead of exploring multiple c multiple drugs model online memsspect predicts the best threat model for the current context using a input prompts and recently generated tokens. We implements these predictors using a lightway based model.

[07:45:05]Second, the direct model cion manager based on the predictors linking. It keeps high probability drafts in best memory and asynchronously prepatches promising candidates.

[07:45:19]This helps ensure that the light straps are available when they are needed.

[07:45:28]Finally, the runtime controller.

[07:45:31]It always learn the best available drops while preparing better candidates in the background.

[07:45:40]As a result, decoding never has to wait for model loading.

[07:45:49]Let me walk through how MEMS pack works at long time.

[07:45:54]We start by learning speculative decoding with the best available drugs already in best memory.

[07:46:04]Meanwhile, the prediction engine analyze the current prompts and recently generated tokens and length the candidate threats. In this this example, T2, T1, and D5 are the top three.

[07:46:24]Based on this linking, the cash manager then evicts lower priority DS and asynchronously prepatches the top length ones.

[07:46:39]Decoding continues throughout this process.

[07:46:45]Once prep patching completes the runtime controller switches to the better dress at the next scheduling points and this process lies throughout generation.

[07:47:03]Now let's move on the evaluation. For all experiments we used on Nvidia Jun nano.

[07:47:12]For models, we use llama 2 and quen 2.5 with seven billion target models contained to inpo using TPTQ and draft model around 0.5 billion parameters.

[07:47:30]We use prompts from diverse data set spanning five domains general code mass law and medical.

[07:47:44]We compare five draft solution strategies.

[07:47:47]First, general static use a single general purpose draft for all prompts.

[07:47:55]Second, Oracle static evaluates all candidate drafts of prime and selects the best one for each com each traps each prompts.

[07:48:08]Third, MABA sync is a state-of-the-art exploration based approach.

[07:48:16]And fourth, Oracle dynamic assumes perfect knowledges of the optimal draft at every scheduling points and serves as upper bound.

[07:48:29]Finally, we compare MEMS pack against all of these baselines.

[07:48:37]This graph shows normalized throughput where all results are normalized to general static mespec improved throughput by 40.7% over maba sync on average.

[07:48:52]The key reason is that mespec jointly optimized d selection and draft legendy.

[07:48:59]It not only selects better draps but also ensure they are legant when they are needed.

[07:49:07]Compared to Oracle dynamic mspec achieves 95 to 97% of its performance.

[07:49:14]This means near Oracle performance without expensive offline evaluation.

[07:49:23]To understand where the performance gaps comes from, we break down execution time into three components. desire execution using the prepared threats, fallback execution using a non-prepared threats and prediction overhead from learning the predictor.

[07:49:44]This graph shows execution time normalized to maba sync.

[07:49:49]MABA sync spends nearly half of its execution time on bbec execution.

[07:49:56]However, mespec reduce this to below 5% while increasing the fraction of desired execution.

[07:50:06]Prediction overhead remains below 3.9%.

[07:50:12]This confirms our key insights.

[07:50:16]Better drops availability leads to fewer bbacks and higher throughput.

[07:50:25]Finally, we analyze the contribution of comp each components.

[07:50:31]Prediction only use the predictor without proactive prepatching.

[07:50:37]Prediction alone improves token acceptance by a token acceptance, but a large portion of execution time is still spent onback execution.

[07:50:52]Once legity management is added, roll back execution drops below 3%.

[07:51:00]This result shows that prediction alone is not enough.

[07:51:05]The key is ensuring that the predicts are actually legend when they are needed.

[07:51:16]To summarize the real bottleneck in adaptive speculative decoding on edge devices in not just selecting the light drops but ensuring that it is light for execution when needed.

[07:51:31]Manspack addresses both through prediction guided legency management as your result. It improves throughput by 40.7% over MAB as sync and achieves 95 to 97% of Oracle dynamic performance.

[07:51:55]The key takeaway is that timely draft availability is the key to realizing adaptive speculative decoding on devices.

[07:52:08]Thank you for your attention. I'd happy to take any questions.

[07:52:20]>> Was there any questions?

[07:52:25]>> Yeah, maybe I'll ask something. Uh, thanks for the presentation. So, I was wondering like you evaluated this on this llama and the other model.

[07:52:32]>> Yeah. are these uh traditional edge uh workloads and and maybe you can comment to like if they if then you execute even smaller models how this will change because I guess this this this overhead of the availability uh depends largely on the size of the model or not >> uh size of draft model target models or >> yeah so you you evaluated on llama and another one, right?

[07:53:04]>> Yep. Oh, >> and I was wondering if if that if those results will change if you use smaller.

[07:53:12]>> Oh, for edge devices.

[07:53:20]>> Yep.

[07:53:22]>> Let me think for a moment.

[07:53:35]Okay, there's no more questions in DS model for CL 0.5 billion model is the smallest available models and we used similarly sized DS for llama. up and uh target model uh >> take that off.

[07:54:14]>> Thank you.

[07:54:18]Okay, thanks. Thanks.

[07:54:20]>> Thank you.

[07:55:21]So the title of the next talk is bridging the memory hotness gap in edge systems with hotness segregated object allocation and the talk will be given by rang from pecking university.

[07:55:39]>> Yes. Okay. Thank you. Uh good afternoon.

[07:55:42]I won. Uh this is Rel from the P University and uh today I'm also delighted to present our work the bridging the memory hotness gap in the edge systems with honest of segregated object allocation. Uh this work focus on bridging a stop but costly mismatch in existing memory management subsystems.

[07:56:05]uh and this work is a collaboration between the PI University and H technologies and we are honored to share it at the conference.

[07:56:14]Uh yes, the edge systems often run with limited DAM. So this makes memory one of the tightest resource in the real world deployment. Uh to use the memory more efficiently, the existing systems often use some hotness aware approaches for optimization. Let's give uh let me give you some example. Uh the me memory swapping will evict some code pages to slow the to to the slow storage to reduce the memory pressure and keep the hot pages into the uh fast RAM to achieve a better performance and the memory duplication prefers to uh share the code redundant pages to make more available memory and preserves other hot huge pages untouched to keep the performance. So both of these two mechanisms are sensitive to the same issue. The kernel must know which exactly know which pages are really hot and which pages are really cold. Uh if this uh page information is inaccurate then both of these two uh two systems will suffer from uh performance degradation.

[07:57:26]uh for example if the memory swapping wrongly evict a page to the slow storage and the application will suffers from the extra page faults because they need to bring the page back to the DM from the slow storage and as well as the corresponding uh uh expensive IO uh if the M duplication wrongly shares the uh pages and the system may incurs additional uh overhead from the comp copy right or as losing the memory saving opportunities.

[07:58:00]Uh however the page hotness is not always accurate because the kernel observes observe the hotness as a page level granularity but the memory objects in the in the in the same page can have very very different access frequencies.

[07:58:16]So this is a page honest crew where the hot and cold objects are coll located in the same page at the same time.

[07:58:26]So this figure shows the uh this pattern the dark and light regions are interle in the same page instead of forming a purely hot or purely cold regions. And this makes the layout will confuse uh confuse the page level classification because a page may looks very very hot even if it only contains only a few uh hot objects. Uh as a result in our microbenchmarks these chronies can significantly degrade the memory access throughput up to an order of magnitude.

[07:59:02]The root cause of the screwess comes from a mismatch between the system and the application because the kernel managed memory uh by the page hotness but the general purpose allocator will place objects according to their size classes. These two rules are are uh optimized for different purpose. Uh the kernel can only conduct page level uh operations through the pitch table. So it wants a cleaner hotness signal to achieve the page level operation. But the general purpose allocator used by allocation prefers to use a side regment strategy to achieve a faster allocation and reduce the me memory fragmentation as well.

[07:59:48]Uh this will increase the mix pages because the same size objects with the same uh with different hotness can be uh can be placed together in the same page but the objects with similar hotness can be scattered across different pages because they their size classes are totally different.

[08:00:08]Uh after the allocator forms this layout the kernel uh sorry the kernel has limited operations. They can only scan the page uh classify the page, split the huge page or just evate the page to the slow storage. Uh but they cannot change the page layout of the objects inside the page uh because it is already determined by the allocator instead of the kernel.

[08:00:35]So our insight to uh to address this problem is that we need to play the objects uh with similar hotness together uh together instead of according to their uh size classes. Uh therefore we present a new memory allocator hotman lock with the hotness aggregated allocation strategy.

[08:00:57]The hot meal will partition the uh entire heap memory into se several sub heaps and each sub heap will owns a continuous uh virtual memory pages to store the objects from the one uh uh hotness class and the object is allocated from the corresponding subhe that exactly matches its expected hotness.

[08:01:21]So with this layout the each page could becomes more hotness uniform. Therefore this can reduce the uh page hotness crew and gives the kernel a cleaner signal for the future memory optimizations.

[08:01:38]Uh the key challenge behind the hot meal log is that we need this hot hotness information before the allocation time before uh uh that is before the object is actually allocated and used and we don't expect to incur too much runtime overhead. Uh therefore hot use a profile guided optimization based method. It will profiles from the previous runs and learns the relation between the allocation as as well as the uh corresponding hotness and then it will use this historical information to guide the further allocation.

[08:02:15]So there are three key components in hot melo. Hot offline analytics will uh learn which allocation context creates hot or cold objects and uh synthesize an application specific allocator with hotness information uh inside the allocator and uh during the runtime the synthesizer allocator will identify the hot uh object hotness according to their uh allocation context and we allocate memory from the corresponding only sub heap to to achieve the hotness segregation and finally the allocator could expose the sub hotness directly to the kernel through our interface. So the colonel can directly use this exposed hist information instead of just relying on some uh online monitoring. And the the way uh we could leave the expensive profiling to the off uh sorry we could leave the expensive learning to the offline and just leave the only uh lightweight classification uh as the allocation time.

[08:03:28]Uh during the offline phrase the hot melo use the profiling to collected the allocation and access information. The profile will instrument one run and tell us which object which allocation creates which objects and how often of each object is accessed. Then it will map the allocation context to their corresponding uh honest classes and after this analysis the hot melo will generate an application specific allocator for the deployment usage.

[08:04:01]The key point is that the profiling cost is paid before the uh real world deployment and during the normal execution the hot melo does not need any uh any online page sampling and it also avoids the read and write barriers on the memory accesses. The learn hotness will become becomes a deployment artifact for the hotness estimation instead of a runtime text.

[08:04:31]And uh uh during the runtime the hot meal needs to identify the p uh object hotness according to the allocation context. The whole process is simple and efficient. The hot meal uses the calling stack frame to represent the uh allocation context. Yes. uh the calling stack frame will be encoded uh to generate an object ID and the hot melo will use this ID to find the target sub heap and finally after after this identification the hot melo will allocate memory from that from that sub heap to achieve the allocation.

[08:05:09]uh we can find that the placement decision only happens once and when the object is created and after the placements the object can be used normally. There is no relocation and no pro access monitoring overhead. This keeps the common memory access paths unchanged and therefore leave the hot meal transparent to any applications.

[08:05:35]Uh based on this horn created layout the hot melo also provides two simplified me memory management strategies for two specific uh scenarios uh that is for the swapping and duplication. For swapping our goal is that we need to protect the most frequently accessed objects. So we use the hot first strategy which groups the hottest objects into one hottest sub heap so that these pages in the hottest sub heap can be pres uh even under the memory pressure. And for the duplication, the goal is totally different because we want to uh find some pages that can be safely shared with uh with low performance risk from the uh unnecessary and additional command right. So we uh we use a code first strategy which groups the codest objects into the codest sub heap so that the system can just only focus on these code pages for sharing and other uh pages in the hot sub heap can uh can keep untouched to achieve a better performance.

[08:06:48]Uh beyond this the allocator could also provide a simple interface to expose the subhave hotness to the kernel. For example, for each sub heap between the edr and the edr plus dice, the h the hot meal could expose the hornness to the kernel through our uh newly added system call interface.

[08:07:08]Uh this is highly useful because the allocator knew knows more about the uh object hotness distribution than the kernel. so that the kernel can triply directly use this information instead of just relying on uh some online monitoring by itself.

[08:07:29]Okay. Uh for the evaluation we test both uh memory swapping and duplication these two scenarios and we compare the hot melo against the common lexy allocators such the one from jibcy jalo and the maloc and we will also compare the hot melo with other hornest aware approaches such as memory t from sosp store from the uh uh osti uh as well for the uh duplication scenario we also compare with the smart MD plus uh that are published in TOC.

[08:08:07]Uh our eval evaluation question is very direct. We want to know if we improve the page layout can the kernel mechanisms actually become more effective.

[08:08:19]Okay. Uh the first result we first checks whether the hot melo could uh reduce the page hotness crew and in our microbenchmark the hot melo could reorganize the memory so that the huge page become much closer to purely hot or purely cold regions and as a consequence uh the it can improve the memory access throughput by up to 7.6 six times.

[08:08:48]And for the real world applications, we also uh measure the maximum within page harness gap. That is the different access of the most frequently accessed objects and the least frequently access access objects in the same page. And this difference are dropped by 8 to 25 times.

[08:09:10]These results confirm that the hot meal lock changes the distribution of the hot and cold objects in the memory and as a result the page level hotness becomes more cleaner and so that the kernel can uh rely on this cleaner hotness signal for better optimization.

[08:09:30]Uh under the memory swapping we found that the hot melo improves the overall performance by up to 42.1%.

[08:09:38]The main reason main reason of this improvement is the fewer page falls. We found that the hot meal could reduce the page faults by up to 90%.

[08:09:49]This happens because the hot objects are grouped together in the same uh in some in the same huge pages and under memory pressure the kernel can keep these pages restant and just evate other coder pages more accurately. Uh in other words, we can more accurately evict the pages and we therefore could reduce the uh unnecessary page faults.

[08:10:17]The hot melo could also helps preserve the huge page in Linux. Uh we found that the swapping can inevitably involve the huge page splitting because when hot and cold objects are placed together in the same p in the same huge page the kernel must split the entire huge page into several 4 kilobits small pages and just evit these pages uh that containing the code objects to the slow storage and this splitting will lose the TRB benefits and as well as incur There's other additional overheads such as the TRB shutdown.

[08:10:55]Uh hot melo groups some hot objects together. So it makes the hot huge page more stable during the execution and as a result the hotmelo could keep keeps an average huge page of about 93%.

[08:11:14]The sim layout principle also helps the memory duplication with a code first strategy. The hot meal could groups the code object together and therefore the system can just focus on these code regions for sharing where other hot huge pages can be preserved for performance.

[08:11:34]On the contrary, smart plus use some online monitoring according to the access speed from the page table. So uh they need to they need this is a mechanism to determine the page hardness as well as which pages to share.

[08:11:50]However, this page level classification may be uh misleading because uh because of the page honest crew and therefore the system will lose some memory saving opportunities.

[08:12:03]uh compared to this that giving a more cleaner page hardness the hot melo could dduplicate dduplicates up to 40.6 times more memory while keeps the memory performance gap stay below 3%.

[08:12:19]Uh okay let me conclude the page hardness crew comes from a mismatch between the uh between object placements and the page level kernel management.

[08:12:30]The hot meal lock use uh reduces screwess uh by placing objects with a similar hotness together. It use the offline profiling and the lightweight allocation time uh classification as well as the simple kernel facing interface to expose the hip hotness and with this layout the swapping could becomes more accurate and the duplication and could save more uh memory. uh sometimes the best system optimization is not uh always a more complicated policy but could be a better interface between the different layers.

[08:13:08]So we have found that a better memory policy could start with the better memory layout. So uh this is end my presentation and uh thank you for your time and I'm happy to take your questions.

[08:13:21]>> Thank you.

[08:13:28]Thank you for the great talk. I was very interested in this uh research project.

[08:13:33]Um I I'm wondering about the uh the experimental setting by the way. So So you mentioned you evaluate your skin with what the mantis and sore those two papers were for from my understanding those are for CX memory. So do you assume this scheme may work for the embedded systems as well or what is your evaluation setting?

[08:14:00]>> Uh yes, we conduct our evaluation in the server server.

[08:14:08]Yes, we uh but we think our design is not uh is is not platform specific.

[08:14:16]We target for the resource limited scenario. So it will also at least it will also works in x uh x86 platform and and at least for some other ARM platform I think it may be work also but I do not try it yet.

[08:14:48]Hi. Um, so if I understand correctly, you have one sub heap for the hot objects and one sub heap for the cold objects.

[08:14:56]>> Uh, yes.

[08:14:59]>> Um, so what I'm But like of course in practice objects are not just hot or cold like binary. There's kind of a continuum, right? There's kind of a spectrum, right? So how do you decide where to draw the line like this object is hot or this object is cold? Um, >> uh, distinguish.

[08:15:16]>> Uh, sorry I didn't didn't get get you.

[08:15:20]>> What makes an object hot? Like what is the >> like like >> was the criteria to classify them >> or like here's a concrete example. So you look at the profile information from the last run and you see okay this object was accessed 100 times. Is it hot or is it cold?

[08:15:39]>> Uh so sorry.

[08:15:44]Hey.

[08:15:46]>> Oh. Ah, sorry. Maybe am I talking too quickly?

[08:15:50]>> Yes. Uh, >> sorry. Sorry. So, uh, for the you're using profileg guided optimization, right?

[08:15:56]>> Yes. Yes.

[08:15:57]>> So, the previous run I I assume you're seeing okay this object was accessed uh 100 times as an example.

[08:16:05]>> Yes.

[08:16:05]>> Right. So, this object that was accessed 100 times, is it hot or is it cold?

[08:16:11]>> Uh, yes. We classify the hot objects and code code objects not according to its absolute access frequencies but we uh we will uh order the entire objects and find the top k objects as hottest.

[08:16:31]>> I see. And and how do you choose the value of k?

[08:16:34]>> Uh sorry.

[08:16:35]>> Ah sorry. How do you so you choose the the top k objects are hot. What's the value of k? Uh the this key could be determined by the the administrator.

[08:16:49]This is userdefined.

[08:16:50]>> So it's configurable.

[08:16:53]>> You the yeah it's configur you can choose what the value is.

[08:16:58]>> Uh actually it can be determined by the uh volume of the DM.

[08:17:05]So you you you need to keep the top K the hottest object into DM. Therefore the DM volume must could keep these objects in the memory.

[08:17:15]So we can determine the K according to these objects the size of these objects and find the corresponding key.

[08:17:24]>> Okay, that makes sense. Thanks.

[08:17:25]>> Oh, thank you. Thank you.

[08:17:28]Uh have you noticed uh some objects if they might be cold in one part of the execution and hot in another part of the execution >> execution >> that there could be objects that start off cold but later become hot >> or vice versa.

[08:17:50]>> Okay. Okay. I I I and uh we do not uh we target for the uh the object hotness across the entire lifetime instead of some windows uh within within some window. Yes, we we uh we need to place some hot hot object together and to create a to provide a cleaner honey signal. So uh in case some there actually maybe some objects maybe uh become cold uh after some time. Yes. But uh during the during the entire lifetime it is actually hot if um if it is really hot but uh uh it actually become colder at some time but we focus on the entire lifetime. Yes.

[08:18:40]>> Thank you.

[08:18:41]>> One quick one.

[08:18:43]>> Um may maybe I missed this. This is for DM right?

[08:18:47]>> Uh we do not have specific swapping method. Uh the DAM could also works.

[08:18:57]>> All right. So I was wondering did you notice any degradation in terms of uh parallelism like for instance if you if you pack the DM pages then you would get many bang conflicts right. So whereas if you have hot accesses across banks then you would have more parallelism within the memory subsystem.

[08:19:20]>> Yes.

[08:19:20]>> Is that something that you noticed?

[08:19:23]Because you are packing the hot objects in a page. So you will res reduce the amount of parallelism.

[08:19:32]No, we need to uh prioritize packing other code objects.

[08:19:38]Therefore, this this should be a co-esign between the system and the application.

[08:19:43]>> So, as an example, right?

[08:19:45]>> Yes.

[08:19:45]>> If everything goes into one page because that's the hottest.

[08:19:49]>> Yes.

[08:19:50]>> Then every access will result in a bank conflict in DM will these objects will be put in the DM and will not be compact.

[08:20:02]>> Right.

[08:20:07]I think it's time to So the last talk we will tell us more on the origins of indirect jumps in embedded software And the talk will be given by Arian Nicolola from the University of Ren. So please sorry.

[08:21:17]Okay. Yes. Okay. So, hello everyone. I'm Ayan Nicolola. I'm a first year PhD student at Enri. And today I'm going to present you the paper on the origins of indirect jumps in embedded software which I co-authored with Isabel Pio Erin Ru and Ronald Lazerm.

[08:21:38]So embedded software often need to provide guarantees regarding correctness, execution time or security.

[08:21:46]Most of the time these guarantees rely on the presence of a control flow graph of the program which is the graph representing all possible execution path of the program. So here you have an example of resource code and the associated cfg.

[08:22:02]So cfg is used for example for data flow analysis worst case execution time estimations or control flow integrity mechanisms. So in order to provide solid guarantees for embedded software the control flow graph must be accurate.

[08:22:19]That means is it must faithfully represent all possible execution path and avoid invisible edges.

[08:22:28]In a CFG we encounter in general two types of control flow transfers.

[08:22:34]Direct jumps whose target is statically known. So here you have an example of a direct jump in risk five assembly coming from a function funk one and going to a label label one. So here the associated cfg is fully accurate because we know exactly where to go.

[08:22:50]The second time of control flow transfer is the in indirect jump. So this time the target address is stored in a register. That means it will only be determined at runtime. An indirect job may have multiple possible destination depending on runtime data and its target may even be completely impossible to determine statically for example if depending on user input.

[08:23:16]So consequently in the general case constructing an accurate CFG in presence of indirect jump is impossible.

[08:23:25]So in this work we are interested in indirect jumps origins. We try to answer the following research question. When and why are indirect jumps generated by the compiler? Are they due to programming constructs used by the developer or are they directly generated by the compiler? Maybe without the developer knowing it. Some other questions arise also. Are there differences in between binaries from different languages? What are what is the impact of optimization levels on indirect generation?

[08:23:56]So to answer these questions, we provide the two following contributions. First a fine grain classification of indirect gems origins gathering programming constructs and compi compiler transformation responsible for the generation and a systematic empirical study on indirect gem distribution in binaries from multiple programming languages compiled with two different compilers at multiple optimization level.

[08:24:21]So before going deeper into the presentation of these contributions, I must first of all introduce the architecture we are working on which is risk 5.

[08:24:31]So risk 5 risk 5 is an instruction set architecture which is open and extensible. We chose it because of its increasing adoption in embedded systems.

[08:24:45]So risk 5 provides 32 general purpose registers from x0 to x31 and for for our work uh some of these registers are relevant to know. So the zero register is hardware to zero. The x1 or ra register is dedicated to the return address in case of a function call and the t0 register may be used as an alternate return address.

[08:25:11]So risk 5 provides a unique indirect jump instruction which is jalr for jump and lake register. This instruction takes as argument the return address and jump to the address contained in the errors one register plus or minus an offset.

[08:25:29]So depending on the value taken by its argument the this instruction may have multiple different behaviors or four different behaviors that are defined by the AI convention of risk 5. So if the return address is properly set we want to to uh come back to the calling point.

[08:25:47]So we know it's an indirect call. On the contrary if the return address is set to zero we want to jump without returning.

[08:25:55]It's a plain indirect jump. If the target address is the return address, then it is a function return. And finally, R 5 provides a fourth type of indirect jump which is generated when a direct jump cannot reach its target.

[08:26:10]Instead, a direct jump is generated preceded by a AO AIPC instruction to construct a complete address.

[08:26:18]So now that I have presented risk 5, let's dive into our first contribution, the detailed classification of indirect jumps origins.

[08:26:27]So as said just before risk five specification allows to uh differentiate four different macro categories of jump.

[08:26:36]So indirect calls returns long run jumps and indirect jumps. Some of these categories can be detailed to further refine the origin of the jump. So here is the complete classification. I will detail a few of these categories in the next part but for now let's focus on this big picture. So some of these categories are due to the compiler. It's the case for returns, long run jumps, dynamic linking calls, term tables, and tail calls. The others are due to the developer.

[08:27:08]So namely dynamic dispatch, function pointers, callbacks, and computed gotos.

[08:27:13]So now I will present three selected categories of jump uh jump through dynamic dispatch, long run jumps and jump tables. So the other categories are detailed in the paper if you are interested.

[08:27:27]So first of all jump through dynamic dispatch. These this category of jump was present in two of the studied languages C++ and Rust and is due to programming construct used by the developer. So in these two languages the concept of polymorphism allows to define abstract interfaces which can be implemented by multiple concrete types.

[08:27:51]So here is a snippet of C++ code. Here we define a base class which contains two virtual methods, method one and method two. Each of them printing a small message. Then we define a derived class which inherits methods from base class and it overrides the method one now printing a different message. Then we define a call method one function which takes as argument a base reference. So here the polymorphism allows to give as argument either either the base class itself or any class inheriting from base. That means we don't know which exact class will be given as argument at runtime and which exact method one will be called. So this is determined at runtime through dynamic dispatch and through the use of a virtual table.

[08:28:43]So for each object the compiler will generate a virtual table which is an array containing pointers to each declared methods. So here for a base object we have an array pointing to method one and method two. Then when a class um overrides a method uh the pointer will be updated to point to the new method. Here for a derived object method one points to the new method one. Then at runtime the v table of the object will be read and a jump indirect jump will be performed to the right method.

[08:29:19]Okay. So now I'm going to present long run jumps. So in risk 5 we have a direct jump instruction which is jal and this instruction instruction can reach targets for uh in a range for of plus or minus one megabyte from the current PC.

[08:29:37]If the target is farther the compiler must must generate a long run jump. So it's it takes the form of a fusion of two instruction AUIPC and GI GR. So IIPC constructs the higher bits of the address and JR constructs the lower part of the address and performs an indirect jump. So here we can notice that even if we have an indirect jump, the address is statically known. So it is a special type of indirect jump that is spec specific to fixed size either.

[08:30:12]Um now jump tables. So jump tables are generated by the compiler to optimize long comparison sequences such as switch statements or large if else sequences.

[08:30:24]So here you have an example of a snippet of code uh of a switch containing three cases. So a jump tables table is an array sorry an array containing the addresses of each case cut blocks of the switch.

[08:30:42]So here for for this code the following gem table is generated pointing to the three cases c blocks. Then at runtime the switch input will be used would use to select which case code blocks to jump to resulting in indirect jump.

[08:30:58]Okay. So here it is for the three categories I presented. Now let's uh let's go to present the second contribution which is our quantitative study on indirect jump generation.

[08:31:10]So the objective of this empirical study is to quantify the number of indirect jump generated in risk five binaries and precisely the proportion of jumps in each identified category. We want to highlight the differences between binaries from different languages and the impact of optimization levels. So I will not present the latter here because of the timing.

[08:31:33]Okay. So um to obtain binaries here is our experimental setup. We as said before we target a risk 5 architecture and precisely a risk 5 32G architecture without compressed in structure. So we studied four different languages C, C++, Forron and Rust. uh we used the spec CPU 2017 benchmarks for C and C++ and for languages and added to it the gypsy and for us we made our own small suit of benchmarks uh using rip gp core utils and cargo. So we compiled these benchmarks using two compilers GCC and LLVM at six different optimization levels.

[08:32:16]Then once the binary were compiled, we disassembled them and perform static analysis on the disassembled code. The idea is to for each uh indirect jump instruction to obtain its category. Uh so we have different level of identification. Some categories of jump can be identified directly from X5 AI depending on the value taken by the registers. So it's the case for returns and long run jumps.

[08:32:45]Then we perform backward backward analysis on instruction preceding the jump uh to identify uh recognizable patterns. So at this level we can identify calls through v tables uh tail calls and dynamic linking calls.

[08:33:00]Then if the disass disassembled code doesn't uh give any int on the category we obtain the source code line using address to line and we perform pattern matching on it. So at this level we can identify function pointers, jump tables and computed gotos.

[08:33:18]Then if some jumps are left unclassified, we can perform a manual pass to identify the category manually.

[08:33:25]But this pass is optional.

[08:33:28]So here are the results we obtain uh after our empirical study. So on this graph are displayed uh benchmarks compiled using optimization level O2. So benchmarks are separated by language and ordered by size. Here are the smallest benchmarks and the largest benchmarks.

[08:33:46]This this way um the histogram bars represent the proportion of indirect jumps uh in each category over the total number of instruction. So I'm going to present the key insights on this graph.

[08:33:59]So first of all we can see that the overall proportion of indirect jumps is quite small only up to 8% of the instructions in R benchmarks.

[08:34:09]Then we can notice that smaller benchmarks exhibit less diversity in different categories of jump. Uh this can be explained because larger benchmarks require more abstractions and thus more indirect jumps.

[08:34:24]Larger benchmarks uh exhibit a higher percentage of long run jumps. This is expected because as the binaries are bigger, direct jumps are more likely to be too far from their targets and to be replaced by long run jumps. We encountered that it's especially the case in rust benchmarks. We encountered that it was due to its linker which by defaults uh generates for every call a long run jumps. Yes, this can be disabled using a specific option.

[08:34:54]Um in C we can see that there is a quite specific you systematic use of function pointers. we encountered that it was mostly function pointers stored in structures uh to serve as methods for the strcts and in C++ we can see that there is a wide use of dynamic dispatch calls which is expected also because of the object-oriented parading of the C++ okay so here it is for our results I will now conclude and open on future works so I presented here our work on the origins of indirect jumps We provided two contributions. A complete classification of indirect jumps origins and static statistics on the number of indirect jumps generated in each category in risk five binaries.

[08:35:44]So now that we have identified the different categories of indirect jumps, future works could um focus on how to manage them in embedded software. So uh development guidelines could be formulated to discourage the use of certain programming patterns or to encourage the use of specific compilation options to remove some categories of indirect jumps. And to go even further, a more radical approach could could think about entirely removing the indirect jump instruction from an ISA and providing compiler and hardware support to efficiently replace them with safer constructs. So this is the approach that I will be exploring during my thesis.

[08:36:29]So thank you for your attention. Feel free to ask a question.

[08:36:38]>> Are there any questions?

[08:36:49]>> Yeah. Hi, thanks for the talk. Uh kind of a small question. So when you say tail calls in the paper, I guess you mean as in like tail call optimization.

[08:36:58]>> Yes. Yes. Tail call optimization. Yes.

[08:37:00]>> Why does that lead to indirect jumps? I would have expected it to be basically like you know a Y loop where you just kind of jump back to the beginning of >> the function each time.

[08:37:09]>> A take call is an uh I mean at the beginning it was supposed to be an indirect call at the end of a function but it is replaced by a jump because it's the last instruction of the function. So you don't need to return to the calling point. It's only a jump which is performed.

[08:37:26]Does it answer your question?

[08:37:27]>> But then in that case there would be no indirect jump uh at the end of the day, right?

[08:37:32]>> No, the the actual the the the call at the beginning was supposed to be an indirect call. So it's replaced by an indirect jump instead of a call.

[08:37:43]I think the question is if it's a tail call there shouldn't be it should be a direct jump and not an indirect one.

[08:37:51]>> Yes, but the the indirect jumps are generated only when the last instruction was supposed to be an indirect call and not direct call. A direct jump will be generated as a tail call as an optimization if the the call at the beginning was a direct call. But here it's indirect calls that are optimized at as indirect jumps.

[08:38:15]>> Okay. Thanks.

[08:38:17]>> Okay.

[08:38:22]>> Yeah. About the radical u go back one slide. Right.

[08:38:26]>> Yes.

[08:38:26]>> So I think removing the the instruction isite quite radical. Yeah, >> you could still enforce compilers to generate server constructs or you could >> like instead of going and finding that in the assembly. Yes.

[08:38:39]>> Like for instance the switch case thing.

[08:38:41]So there's a lot of cases where the compiler could >> Yeah.

[08:38:44]>> produce >> extra debug information something so that they so that they because what the interest is to be able to reconstruct more accurate CFGs.

[08:38:52]>> Yes. Exactly.

[08:38:53]>> So if this information is known by the compiler >> because it is known.

[08:38:56]>> Yes.

[08:38:57]>> So this can be >> passed somehow. Yes.

[08:39:00]>> So that the code of flow graph can be more accurately produced.

[08:39:03]>> Yes. But there are some kinds of indirect jumps that are that are more difficult to to replace with something else.

[08:39:10]>> Yeah. No, that part I know. But I'm saying that you could increase the accuracy of the control flow.

[08:39:16]>> Yes, of course. That's why there could be development guidelines uh to >> No, but with the compiler as well.

[08:39:22]>> Okay.

[08:39:22]>> So the compiler could Yes.

[08:39:24]>> could help out in >> Yeah. Yes. The compiler itself could uh Yeah.

[08:39:31]Hi. Hi. Thanks for this uh analysis. Uh two questions. First of all, uh C++ supports lambda. Did that appear at all in in this analysis at all?

[08:39:44]>> So we used the spec CPU 2017 and we encountered that there were no lambda at this point. But I think yes, it would be an indirect jump.

[08:39:53]>> Okay. Other question is uh Rust has uh function blocks a lot like lambdas and did those and they use them a lot. I I guess those show up in the analysis.

[08:40:04]>> Yes. Uh you can read the paper if you want but uh yes but uh uh this kind of jump um it depends if they are under a trait under if they are given as argument like you you speak about closures that's it. Uh yes, closures can result in an indirect jumps if they are given as argument uh uh in a function.

[08:40:25]>> Thank you.

[08:40:29]>> In your presentation, you only use 32bit uh instructions. Have you looked at the uh applications with different bits, >> 64 bits, 128 bits?

[08:40:42]>> I think uh 64 instead of 32.

[08:40:45]>> Yeah, >> I think it normally doesn't change anything. It's only the addressing that's >> Yeah. But let's uh it would be interesting to see whether the analysis you have done would be >> Yes. Yes. Maybe >> the same or will become different >> less big jumps.

[08:40:59]>> Yes. Yes. For example.

[08:41:00]>> Yeah. For example.

[08:41:04]>> I have a question as well. So >> you mentioned uh you are using a static analysis.

[08:41:09]>> Yes. Could you tell us why you need uh you you need to perform the analyszis at the source code and at the binary level as well? So why do you need both and what do you do with this analysis?

[08:41:23]Um we wanted at the beginning to do uh only uh static analysis on the binary but it was missing we were missing information so we continue at the source code level. We could have used only the source code maybe um but yes it was also to see if some specific patterns were generated in the assembly. Uh so both were interesting for us.

[08:41:47]>> Okay.

[08:41:48]Okay. So let's thank the speaker again.

[08:41:58]Don't leave.

[08:42:33]Good. Uh final talk of the day, a a quick closing session. Um yeah, so let's hope that you guys enjoyed the program. I think it was was pretty cool.

[08:42:46]Um big thanks to the to the presenters.

[08:42:50]I think we're pretty much on time. I think there were only two people that over overshot. So most of the speakers were really great on on the timing. So thanks a lot for that. Um there was something that I already announced in the opening is that we decided to give um distinguished reviewer awards and so we have uh selected these two TPC members. Um only let me let me fetch the right one.

[08:43:21]Um yeah only uh so I I'll ask JJ maybe to come also in front for the for the picture taking. Uh so yeah the recipients were uh Claus Schneider and uh Junguk Choy. So let's give them a hand both of them.

[08:43:39]And since uh Junguk is here so maybe we can do this physically. Claus will receive his award. Uh purple. So maybe you can just take it. Yeah.

[08:43:49]Congratulations first of all.

[08:43:50]>> Oh thank you.

[08:43:51]>> And thanks for the service.

[08:43:53]>> So take this maybe stay in the middle.

[08:43:55]>> Is that okay with the light in the back?

[08:43:57]Maybe let's let's do something. Let's do this, right? All right.

[08:44:09]>> Yeah. Great. Thank you. Thank you.

[08:44:16]>> All right. We also said that we had u narrowed down uh the selection for the best paper award best paper award uh to these two candidates that was based on the um reviewers comment the discussion within hot crap system. Uh then we had some extra reviewers check those uh six papers that we selected and we got a ranking but we wanted to wait for the conference to then uh decide on the final winner and the winner is um towards uh verifiable system code using a DSL. Um so big congrats. Um, we actually have one for each author.

[08:45:03]So, you can pose and maybe you take all of them >> and and >> Yeah. Should I do this again? Yeah. Congratulations.

[08:45:14]>> Thank you.

[08:45:19]>> Right. Thank you.

[08:45:24]Congrats.

[08:45:24]>> Thank you.

[08:45:25]>> Yeah. Thank you. Congrats.

[08:45:28]All right. And uh and that's it. So, thanks thanks for attending LTS 2026. Um as I said before, thanks to the authors for submitting for for the speakers for sticking to the time uh to the session chairs which did an excellent job moderating uh sim committee members and of course the attendees. I think it was a very nice uh conference. No no no no ill questions uh and and question from the audience. So, so thanks a lot and yeah, see you next year. We don't know yet where somewhere in the globe. So, I wanted to take a picture, but of course, we cannot uh take that out like from the field organizers. They they should announce that. All right. Thank you very much. Enjoy the the rest of the afternoon.

[08:46:16]>> Yeah.

[08:46:22]I want to record uh that this uh our thanks to Geronimo and Christian for a fantastically conducted PC meeting. Uh thanks to JJ and the rest of the steering committee for making this a wonderful conference. Uh I was really I've been on many program committees.

[08:46:40]I've chaired conferences. I think we I learned a lot during the program committee meeting.

[08:46:45]>> It was very efficient, right? It was like >> you were very efficient and you were very I think the the quest the kind of questions that were discussed in the PC meeting was was really very good. Uh I also would encourage all the people here and your contacts your six degrees of uh connection to submit next year to LCT.

[08:47:09]>> Great comment that was not rehearsed. So thank you. Yeah. So I on behalf of ste committee I would like to second the uh program committee and also genomic one for the events if we plan to organize LCPS next year as a two full day event.

[08:47:24]>> All right >> and if you have any feedbacks or any suggestions welcome to reach me. uh we would like to have the LCPS uh being still part of PLI but with more uh activities not only the presentations from the authors we would also like to include uh further events >> maybe a panel. Yeah.

[08:47:45]>> Right.

[08:47:46]>> So and so we would like to also hear your opinions. Welcome to talk to me or write me a mail or message if you like.

[08:47:54]Yeah. And thanks a lot for attending the conference. Hope to see you next year somewhere.

Related Videos

Artificial Intelligence

AI Agent Mastery Certification Course: Lab 4 – Tools & MCP

arizeai

350 views•2026-06-16

Artificial Intelligence

Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News

kaiexplainsYT

111 views•2026-06-16

Artificial Intelligence

He Believes AI Could Replace Humanity Faster Than Anyone Expects

LondonRealTV

815 views•2026-06-15

Artificial Intelligence

General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality

HPE

108 views•2026-06-17

Artificial Intelligence

Google DeepMind’s AI Halves UK Housing Planning Time

60secondsignals

467 views•2026-06-17

Artificial Intelligence

The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!

ColeMedin

569 views•2026-06-18

Artificial Intelligence

Why prompt injection is AI's biggest fail

usemultiplier

1K views•2026-06-17

Artificial Intelligence

The End of Annoying AI Interruptions? LiveKit Turn Detector v1 Tested

livekit_io

190 views•2026-06-17

Trending

Nobel Scientist Creates Device to Harvest Water From Desert Air

DrBenMiles

2200K views•2026-06-16

GROW A GARDEN 2 UPDATE

KreekCraft

668K views•2026-06-20

Something's off about my cat...

griffingraue

4534K views•2026-06-16

উটের কুঁজের মধ্যে কি থাকে?

MrBonGrow

1861K views•2026-06-18