Do calculus provides three fundamental rules for transforming causal queries with do operators into estimable expressions from observational data: Rule 1 allows removing variables that become independent in the intervened graph, Rule 2 enables replacing do operators with conditioning when no backdoor paths exist, and Rule 3 permits removing do operators entirely when no causal effect travels through the conditioning set. These rules are both sound (correct when applied) and complete (if no rule applies, the causal effect is not identifiable from observational data), enabling researchers to estimate causal effects like average treatment effects without conducting experiments.
Approfondir
Prérequis
- Pas de données disponibles.
Prochaines étapes
- Pas de données disponibles.
Approfondir
CAI - 04 do Calculus | Causality for AI & ML | TU Darmstadt | Winter Semester 2025/26Ajouté :
Welcome to the fourth lecture uh on on do calculus um this time so last time we saw um structural causal models uh what we can use them for how we define them and some some applications in modeling and and what choices we can make for example uh having linear equations in the for the equations.
So basically if you remember we have a set of endogenous variables which are basically the ones we we observe. We have a set of exogenous variables which are usually noise variables or sometimes unobserved variables.
And all of these variables are usually connected via a set of structural equations which then define how effects propagate through the network. Um and for the exogenous variables we then also have a distribution that we assume um because we can't observe them but we assume how they are distributed um in their domain. So basically this could be gshian noise or a uniform distribution or whatever.
So if we got all of this this topple that defines a structural causal model um this then induces a graph where all our variables are uh the nodes and the structural equations from the parents to the to their children define uh the edges. Um so just a brief recap because there was some confusion also from my side. So we define parents, children, ancestors and descendants and with children particularly we mean or I I will use that for the direct children.
Okay.
So yeah we we saw this uh plant or watering plant example. Um yeah all the variables are defined to be in the boolean domain. We have a Benoli distribution basically saying okay uh Tom or our main character remembers 50% of the time to to message his friends uh that then take care of the plant. Um for example the plant is healthy if um A or B or maybe both care for the uh plant.
Um yeah and uh this this basically the system behaves according to the structural equations.
So then we talked about interventions and interventions basically are the case where we yeah intervene on the system which means we artificially define or enforce a certain value for a variable. For example, we could say, okay, A is always true or always false.
What this corresponds to is that we replace the structural equation with a constant assignment. And this in the craft then means that uh a is no longer dependent on n. So we also cut the edges to all the parents. So there are some variations. We can do soft interventions. For example, again the boolean case doesn't make know that much sense but we could uh shift the values.
We could for example also say okay we take the original structural equation and invert it or or something like this.
So yeah, but uh the the main thing is if you're looking at the craft hard interventions and we will mainly talk about them uh cut the cut the edges to to the parents the inter variable.
So then we had a very brief look into causal inference like what it means why why we want to have it and basically the main task that we want to solve is usually this uh query which means um what is the outcome of a particular variable given that we intervene on some other variable.
And this is important um that we use an intervention here because what it means is that we decouple it or want to decouple it from any other influence from confounders or um yeah variable selection. So we really want to not only observe the system like under conditioning but we really want to know what it means to to change a variable.
what are the the total effects of of like going there and and artificially altering the system, right? And then if if we have this kind of thing under some linearity assumptions and so on, we we can for example define average treatment effects which is then the probability of like setting X to one versus setting X to zero in in boolean domain for example.
um and then observe how much x changes right and here there is a lot of assumptions like if we are in a like continuous domain uh on the real numbers or whatever right we we haven't talked about what is if we change x like leave it here at zero but change it here to one like how it extends beyond like the the two cases so you have to maybe assume that it's a linear continuation um that there are no linear nonlinearities and so on in in the system um to to get a meaningful like semantic of this operator but for most cases um it's it's something like do I want to treat a patient or not uh and there this this expression is sufficient so maybe let's start with the most well-known and maybe also one of the simplest adjust adjustments we can do.
Basically, our idea was that we were trying to get rid of this do operator, right, which we first introduced because it gives us all the nice properties, but then try to estimate the effects of this operator from a purely observational uh setting. And this is important because in many times we we don't have interventional data, right? we we see some sort of data set that someone has collected for us but it might not have the particular intervention that that we want to analyze right at this moment um in it. So we we probably want to um um estimate it from from observational data which means we need to transform this query with a set of rules um to to make that possible.
So yeah, the the most simple case is that we have a single confounder and that we observe it. Here it's a set which influences X and Y.
And now if we were to go and simply estimate Y like the influence of X on Y without caring for set, we would get a biased estimate, right? because the the influence of set then somehow influences both and that messes with with our regression.
Um so we we somehow need to account for it and this is where the back door criterion comes in. I will simply like show you the definition and we will apply it and later on we will have like the exact rules why it works um how we can derive it. Um yeah so basically the the thing is we want to find an adjustment set um which is just some random set of variables in the graph um distinct from X and Y and the first thing is for the back door criterion that's uh no node in set is a descendant of X. So um yeah like X can have no effect um on any variable in set and that set plots every path between X and Y that contains an arrow into X. So yeah and then we we have a this sort of formula which is always the same and we will see some examples.
So back to this very simple graph we we can check our uh two uh yeah rules or assumptions that we need to make for this to work. So set here is not an descendant of X because it's a parent and there is no other path that leads into set. Um and here it's it's also like if we know it's a parent because of explicitly it can't be a descendant right because then we would have a cycle set to x and x to set but uh so this first rule is fulfilled and for the second one um we we have this um vector path right we we have an arrow into x but if we condition on set. This is then simple desparation um or not not desparation but uh set basically blocks this path that that goes um x set z y and also there are no other paths that uh that we need to plug.
So yeah then it's kind of easy we go and we apply this expression we have seen earlier basically we condition x um y on x and additionally onset and what we do in the end is that we take the effects of set and marginalize that that thing out in the end. So that's that's basically all very very simple.
Um I I think we will do a pre exercise on that in the in the next lecture. So really if if you have variables like boolean variables, it's it's quite easy to to see if you once do um a simple regression without adjusting for C uh for set and once with the adjustment. Um yeah, you should do that.
So right maybe a bit of a like not much different example but just to to show you how how it extends be beyond the the um simple case. So we we can have two of these variables just really simply duplicating the the confounder. Um now it's the the same game again. So we we check um none of the variables isn't descendant of X. um both x. So we now have a set of two variables block all the paths from x to y and uh then simply we we adjust for both of these variables. Um so here the the one things for example is that this joint distribution sometimes if there are additional edges or something you you can't try it apart but here they are independent. So um that makes the the computation a bit easier. Um but again nothing nothing really special. So maybe have a look at at this graph where it becomes a bit more complicated.
And uh again we want to have the effects of X on Y. Um and we we need to find an adjustment set of all the other variables that um where no node is a descendant of X and set blocks every path between X and Y that contains an arrow into X. Um so I don't know maybe I I will leave it as an exercise. Maybe someone has an idea what what would be a good adjustment set.
Yeah.
>> A B Z C is a descendants.
>> Sorry again.
>> A B and Z and C A B set and C. So all of the above.
>> Yeah.
>> Yeah. This this would actually be a valid adjustment set. Um so I just as a hint there there are adjustment sets and like a lot of adjustment sets and minimal adjustment sets. So some some of the variables might not be needed.
The the main trick here is mainly that we need to condition on set right because it's a confounder. But basically if we do that it opens this collider here. So we need to break this additional path where B A C and C set and C. So um yeah maybe maybe try it out and um yeah >> you say p and c because the effects of a already seen and b and z.
>> Yeah. So I I don't want to like spoil the the the answer but it's like a two variable adjustment set or the minimum because they are sufficient to to um block all the parts.
>> Yes. So um for this there is this very cool tool uh the Haggity um which has also an online editor or yeah page where you can basically um try like go and and uh draw these crafts that we just discussed.
And so I I planned those out here but it it basically tells you the the minimal adjustment sets that you have. And um you can also try and click on a particular variable and then say well I I want to adjust for it. So that you can also check like which additional variables uh you you can adjust for without like breaking the the criterion.
Yeah. So that this is quite cool. If you're unsure or want to like do some exercises, prepare for the exam, I can highly recommend uh like going there and just playing with some of the crafts or just checking your your results.
So if there's a back door adjustment, there's a front door adjustment obviously. Um no. So the main problem is that particularly in economy or like very complex systems we usually don't observe or don't know all the confounders right like it's it's very hard to like account and adjust for every confounder there is and um basically to yeah uh collect enough data think of everything that that is going on in these complex systems and for this there there's another way. Um yeah so particularly also if we just don't have the data set is unobserved so maybe we know it's a confounder but we we just can't adjust for it because we have no data then we we can try to apply the front door criterion which basically means that we have an intermediate variable on the path of x to y and um we basically use this variable because it's um this path x to m and and m to y these pairs are individually not confounded by by c. So of course we again need to make this assumption that M is not confounded, right? So we still need to be sure that our model is correct. But maybe some expert can tell us, okay, uh I'm I'm sure this M is is independent of of all the other confounders.
But yeah, if if we can do that, then we like take the individual paths here and here and basically fuse them together to to get the the uh standard or like total effect in the end.
So the the first uh thing that we need to check is that um we still again want to get rid of this two operator, right?
And to be able to do this, we we need to be sure that there are no open vector paths from X to M. Um in this scenario there there aren't because set confounds um X and Y but we don't condition on Y.
So this uh collider doesn't become active. So this this path uh isn't is not open. So we can like remove the do operator and and basically just condition on it.
And for the second part again we we want to estimate this m um conditioned on do y. Um again we need to check that there is no um open back door path but here there is a back door path right it's it goes mx set y and here we um condition on y um but we can lock it by conditioning on x um sorry for the confusion definitely it should be the other way around yeah yeah sorry for this I I think it will be correct in in the later slides again.
But uh yeah, you're right. That's this should be why this should and thanks for for pointing this out.
>> Yeah, it's fine. Okay, at least the last slide uh where we put everything together. So yeah, we we take dox uh m conditioned on do x like this first path or part. Then we uh want to estimate y conditioned on do m and then we take all the like the path we or the formulas we we got before in the individual parts and and fuse them together basically. Yeah. So this this is the correct formula which for the first part we don't need to do anything.
We can just replace the do and for the second part we need to condition on the x and marginalize it out. And here it's just um be careful. So um this x and this x are not the same because here we we have that x from from uh from the do operator which is a particular value or maybe the the distribution we we want to observe. And this is an intermediate variable which we only need for the second part um to to do the adjustment here. So um these are are not the same. Don't don't get confused by that. Yeah.
But once we we have this, we can simply plug it in into our favorite library or compute it by hand and it it will then give us the unbiased estimate of u the effects of changing x on on y for this particular scenario.
Yeah. So again here here are the like formal criteria.
So set needs to intercept all the paths uh between X and Y. So all the directed paths um in in that direction.
Um there cannot be a back door path from uh X to set. Um it's it's like at least for this criterion it's it's not allowed. We will see the general case later. But here um only this frontal path u is what we are looking for.
And uh again because we saw in the second part of our derivation there might be backto paths um happening from set to y like if we were to do this do set on to y um they need to all to be pl by by x basically. So really this this is the second part right telling us it might be back to past but we can't lock them in right and then this this actually is the general formula of course it extends if we uh get like multiple sets um or like multiple variables in in this set. So the m here as a single variable could be multiple ones and then we we get multiple sums because we need to sum them out for for every of those variables.
So maybe you remember back the um last lecture or the the first lecture actually um where we talked about okay but what can we now do with it? So, for example, the tobacco industry claimed that smoking might not cause cancer, that it's some genetic factors or whatever going on. And um here we could, for example, of course in reality it's way more complicated and there might still be a debate of of whether your your causal model or your graphical model that you're drawing is is correct.
But for example, you could say, well, I'm I'm not really sure. Okay, we can sequence all the genes by now, but maybe we we are not sure we really understand how how things work and maybe we we treat them as as unobserved or like we we don't know the effects exactly, but maybe we can like measure the the depositing of of tar in the lungs. And now we we have an intermediate factor which we can measure and we can I think all agree on that uh the the deposit of tar or that tarn does not appear because of any like genes or whatever. Right? So basically now we we have this exact um criterion of the frontal criterion where we can estimate um some sort of effect of smoking on on getting cancer um via this intermediate variables and we are free or like do not have to find this this counter argument where we need to like exclude every possible gene or whatever environment enironmental factor there there could be that um that that could otherwise confound uh smoking and and uh getting cancer. So yeah, this this is a really nice example. Of course, then you can say, well, maybe there are other effects going on and so on, but um yeah, at least we are getting rid of of all these weird confounders that there might be and that might that people might bring up otherwise. So then it really becomes a discussion um by the experts whether this model now really holds like if there are some some intermediate other factors that we didn't observe but I I think you can see a quite strong effect probably from just uh computing this.
So um the the criterians that we have seen before uh front and back door they they are somehow sound but not complete which basically means that if we see um some adjustments said set said that um that makes the vector criterion identifiable. We know we know it we are computing the right thing. So we know we get an unbiased estimate. Same thing for the front door criterion. Although I wrote back door again but uh so for back and front door if we find a valid adjustment set um then we we know we get an unbiased estimate. But there might be cases where front and back door don't give us an unbiased estimate or at least the the conditions for applying it don't hold. But we could still get some sort of estim that that helps us to to identify the the true causal effect.
And uh this is then basically done via the do calculus. and we will um make a brief excursion to to that.
Right? So this is maybe the one of the most important uh set of rules in in causality because uh differently from the uh front and back door criterion.
These rules are not only sound but complete. Which basically means if you find um a way to to adjust for for some variables according to the rules I will show you in a moment then you are guaranteed that that you get an unbiased estimate. And on the opposite if you don't find and by that I I don't mean that you you are not able to find but you can show that there is no uh no such estimate. Um if you don't find an adjustment set then you also know that the formula is not identifiable via observational data which is quite an important insight because then it tells you okay you need to make some some further assumptions.
So here uh there there's this one thing where it says well the do calculus is for non-parametric models and what this basically means is that if you like we we don't assume any functional form on our structural equations right like as we have seen before we work with the distributions of the variables and how they yeah how we adjust for them and whatever like normal rules of um of probability um but we don't like we didn't assume that there is a linear dependence between variables or something like this right and this is the same here so um we we don't make any assumptions on the on the functional form of of the equations between variables um but sometimes and I will show an example later if we make assumptions on that we might still be able to identify some yeah some effects which are not identifiable in the in the general case.
So this this might be important and and help us in in some cases.
So maybe let's jump right into it and um sorry I mean there there is a these are graphical criterions but um I I think the the best thing to understanding them is to like later apply them and really just have a look at it. um what what I mean on on particular crafts when when you apply them uh I try to always give some like like easy to understand uh interpretation but um at least for for the second and third rule um then it's it's a bit more tricky to to figure out when we are able to to apply all of these variables uh all of these rules and and conditions.
So basically what all of these rules have in common that we have uh sets of variables X Y uh W and Z or set and these all can be sets of nodes. Um commonly we like usually X and Y we are interested in a single node but that doesn't have to be um and additionally W often times sometimes also set uh can be empty sets.
So when we have cases for example where we don't need to adjust for things here it's a bit tricky for example because uh yeah we we want to have an independence between y and some sort of variable yes and I think this is um a quite easy to understand rule. Basically what it tells us is under which conditions if we have such an expression we can delete um observations which basically means um yeah if this intervention do x makes y and set independent in the intervened craft then we can simply remove it from our equation. So then then it has no influence on on the probability or distribution we're trying to estimate.
So yeah, I I think that that is quite easy to to check. And uh just a small hint if you look uh these these things up in the original paper or also in um the book's causality or book of Y um you might find that instead of do X here um they do do not use the do operator directly but they say okay we fix the value of of X to a particular value and here we observe with the craft where all edges into X are cut which basically is the do operator in that case but uh yeah just uh hint don't don't get confused if you see something else I think it's an X head that is used in in some of the uh papers so then um and this is where where this um notation maybe makes a bit more sense because here we are cutting all arrows into X and here this this underscore basically means that we cut all edges that like go out of set um and this is similar to the um rule before but here it's not um that we can delete an observation but here tells you under which conditions we can basically get rid of a do operator. So where we can exchange um a do set by a simple set.
Um yeah and and if you look at the interpretation or um this this can basically be done if if these two variables become independent in this particular craft. So the the um cutting edges into set um that is basically again from this dox view and um this this expression here basically tells us that we we want to find or can only apply this criterion if there are no back door paths that um that uh go going to set or that's um yeah uh that no the other way. Sorry. Yeah, that that there are no back door paths that go basically via set um because otherwise it would make a difference if we cut the parents or not and if it doesn't make a difference then we can exchange it. So we we will see on a on an example later but like really it's it can be quite hard to to figure out if you have a craft okay which rules do I need to apply also the the tricky part here is that sometimes you want to do it in the other direction. So sometimes you want to like do an intermediate step and introduce a do operator on some variable in order to be able to um apply some of the other rules later on. So um yeah.
So, and now the third and and final rule, at least there are only three of them, um, is this one, which is now the a bit more stronger criterion of the second one, which basically tells us, okay, not only can we remove the do operator on and and only condition on set, but we can remove the do set all together. Like basically, we we ignore this this variable here.
Um yeah and this is again we we need to show independence between X and Z set.
And um now this part here we we do a bit of a weird thing to to the craft. Again, we cut all the edges that go into X uh because we still have this new operator here with dox.
But then we also cut all the edges that go into variables that are not uh an ancestors of of W. So again the the idea is that that we need to make sure if we delete this do operator that there there is basically no causal effect traveling through a set and that removing this basically has no effect on on the quantity we are trying to derive.
So there there are basically two two cases either there is like no no causal effect traveling through or it it's not affected set is not affected by y or if it uh not affected by x. So it but if it uh is affected by X then at least we have no collider or nothing else. So this effect that goes through set does not reach Y.
Right? So set cannot be in directly uh be affected or have an arrow into Y or if it does um like they're we don't want to it to be apparent of any of the W nodes because this could um if we condition on them they're they're still in the conditioning set right this could could close uh colliders and also give um like open a path that then has an effect on y. So basically really it's it's a bit tricky to to check on the on the graph but basically what it means is that if we see that that we intervene on the variable and there there is no causal effect traveling through that goes from x into y then then we're fine to remove it. But really like stick to like this is the intuition but really stick to this um like checking this graphical criterion.
Yeah. So here again are all the rules as I said the first one it helps us to remove some observational quantities.
The second one helps us to remove the do do operator and then switch to conditioning. And the last one helps us to remove the the do operator or a do operator altogether uh from from our estimount.
Yeah. So this uh this is the uh easy derivation of of the frontal criterion.
Uh so I I just put it here because it has some nice mix of applications of the rules. uh we we won't go through it uh like step by step now but um maybe you you can sit down and and as an exercise like check all the individual individual steps um I'm pretty sure I I haven't messed anything up uh because this is basically if you look into I think in in causality I'm think also In the book of Y there there is this exact derivation again they they don't use the do operator but like fix the values of of m and x and so on but again so maybe what it also tells you right it's it's not only the rules of do calculus uh of course all the the rules of probability um apply so here in this first step what we really do is that we still like factoriize or decompose this disjoint distribution or this distribution we have here according to the causal craft. Right? So we say okay we want to have x um conditioned on uh y conditioned on x and m. And here because of this confounder we we aren't really sure like M would block this chain X M Y but because of this confounder we we still include the X here because we we aren't really sure if we like it if it might have some some influence on on the variable here and then again uh this other part right that's that's easy then this is where we um which we had or what we had before.
Uh it's the direct effect of um to yeah of of x on on y.
Um no.
Did I mess it up?
Oh no. Right. This this is um m on x. Um right. So this is this this first part here.
Um yeah but this is really the the de composition of the distribution according to this graph. And here in in the very first step we we get this weird um or unintuitive I'd say application of rule two right where we introduce a do operator um to then proceed further. Um here in the the third step we at least do do the other way around. Here we remove the x. Um this is I think what we we have argued before like in the in the u initial derivation of the of the front door criterion.
Um yeah and and then we we do yeah really just this chain and I I figure there is no like general rule of thumb how how to apply all these transforms right. So sometimes you have to apply some like introduce a do operator. Here we can then remove this dox um and so on. Here again we we apply like we add this intermediate variable and it's it's sadly a thing that you just have to build up some intuition. um maybe get a bit lucky. Do a bit of a search um to to find the right formula that in the end gives you gives you the the um yeah estim that that comes without any dues.
Um but yeah basically you you somehow need to know what you are looking for like have some intuition how the formula in the end needs to look like and then try to find a series of of applications of rules to to arrive there basically but uh don't worry like we we won't uh ask this in the exam it's just to to show you uh what can be done and uh in the book of why and so on there are also other applications uh which are sometimes easier um and also which then instead of do y uh do x um to to do the effect of y you could also have cases where it's might maybe unethical to directly intervene on x but you still want to get the effect um of of x on on y. So there is also some sort of like conditional adjustment or um like to to have some other proxy variables um that that are derived from X and so on that you then can um use to to get some information on X and from that estimate the effect of X on Y and this might or might not still include some some do operation and so on. um and can all be like derived uh with the help of do calculus.
So as an exercise and um I we we won't discuss it here and then maybe in the in the next lecture or start of next lecture um maybe you can try to derive the um the back door criterion and there are um there the hints I can give you it's it's much easier I would say it's it's um only I think three steps that you need to apply. And the first step again is like this not a do calculus rule but like decompose this this expression according to the craft. And then it's also rules one and two um which might be a bit easier to apply. So this this weird rule three uh doesn't play a role here. But yeah, maybe you can give it a try, have a look at it. Um, and of course you you know what should be the result. So maybe you can also try to work your way backwards from knowing the result to to getting back to this formula. Although I I think both ways are are fine.
So then as as a last part we we now saw a lot of crafts which are identifiable.
Um but there is also a lot of uh cases where crafts are not identifiable and um this is really taken from from the from the book causality. Uh some some cases um that that are not identifiable are for example if we have on on a path from X to Y if we have some confounding directly from X to any of its children.
There's sadly no way of um of adjusting for that. Right? So we have no intermediate factor. we we can't condition on the confounder. Uh that's that's for example a scenario that we can never recover from. Like if you see this, yeah, you you don't need to search any further. There there is no way we are we are going to to to adjust for that. In in a real scenario, what you will do is then probably think of okay, is there any way that I maybe get data where I can just control for X, right, to to break this this confounding error that's that goes into X or maybe can I collect some data on some mediator that that is on on the path, right? Some some some variable in between there. Um yeah uh because otherwise that's that's a problem.
>> Would it be if you had uh a node between X and Z you could do front front door adjustment and then >> Yeah. Yeah. Exactly. Right. That that's the thing. So if if we have an mediator in between X and Z then then we do the front door adjustment and then we're fine again. Right.
But yeah like with these crafts as they are drawn here we we have no no chance.
Um >> sorry back to that again. Is that maybe like a in practice is that could that be framed as a feature engineering problem?
>> Oh yeah yeah definitely. So I mean this is also what uh what you sometimes then see in maybe machine learning or whatever.
um like if if you're doing all these kinds of adjustments, you need the right data, right? So it it might not be that it's it's not that we can't identify the the problem per se, but this is always about which variables have we measured.
Um do we have enough data to yeah to to get an unbiased estimate? And if we don't have that then we need to think of ways or usually like if we know this craft in advance we would go to our clinicians or whatever and and tell them hey I know a craft looks like this could you maybe think of a variable that is in between that is a mediator um or could you maybe record this confounding factor right I mean that's that's also a possibility if if we know this this factor and we just can adjust for it then we're also fine. So yeah, this is um of course a modeling problem or a general problem if we are just given the data. But of course we should be smart in advance and and um try to to collect the data in a way that we can um estimate these effects because from this craft structure we we know in advance if we can estimate them or not. It's not like we need to to wait for the data and then uh be surprised that it's not sufficient or something, right? It's it's really on the craft structure, not the data itself. But with the craft, it depends on the variables we record.
>> So if you think about uh maybe unrelated, but if you think about binary classification using deep learning, yes, >> if you interpret it as >> uh trying to find a representation of a data that you're trying to linearly separate it, could you or had to be researched in finding a representation or emitting of the data that >> you give you convenient causal models.
>> Yeah. Yeah. Yeah. So I mean it's it's always a bit tricky to talk about causal models in in deep learning models right to like reason about does does my neural nets my I don't know diffusion model does it do vector adjustment internally or whatever right probably not but um what you can do and what has been done before is there are works on causal representation learning And also on like neural causal models which for example then adjust like try to learn these individual arrows between X and Z set and Y and so on via neural networks and then if you fit these these functions via neural nets then you can still on top of it apply your adjustment and so on. Right? So maybe you extract individual features from an image and you you have your label Y that you want to like predict um or request and then you yeah you you use your neural networks to to extract some features and on top of them or from them you you then do some sort of adjustment to to account for biases.
intervening or even >> yeah but I mean that that's so yeah I mean you you need to start with something right you you need to make the assumption for example that your neural net um truthfully extracts features for example uh that it's not biased that it really looks what what it's what we saw before right that it's for classifying horses or like indicating I think if there's a horse in the image or not that it really looks at this image content and not at some spurious feature like that's that's confounding for example the image and and label but uh we will like right after finishing all all of this rather formal uh stuff on on simple crafts and next time we will do causal discovery but then we will start on like deep learning or causal uh and neural methods and then we will show you criterions under which we can guarantee that our models like learn the right thing learn disentangled representations and also how we can make them to to um request to unbiased uh predictions.
Yeah.
So um maybe before we do a short break um just this this one case um the the craft that was also in the the slide before.
Um here we we have a very interesting case where we can identify. So again we want to identify do x um having whatever effect on y and it is that we can identify the effects of dox on set one. So this part we can identify set one on y but we cannot like even with the help of do calculus identify this this whole thing because um and this is also not not always true but um here the intuition basically is that for x on on set one we we have an empty adjustment set for set one on Y, we need to adjust for set two.
And the problem now is that if we want to do this joint derivation, then basically we we can't find a formula that basically simultaneously adjusts and does not adjust for set one.
there will always be some vector path that is open via via this variable because they are both confounded here.
So in the in the joint like because we don't make any assumptions on the relation between these variables we don't know if like if we are looking on the total effect on x uh of x on y if this set one might not introduce some weird behavior some nonlinearities or whatever that we didn't see when we only look at these individual paths separately. So that's that's the intuition here. But um so last time we saw um this linear additive noise models, right? And and these help us here because what we now can do uh with the help of do calculus we can identify this path and if we have it we maybe have an AT or whatever we can identify this the the def factor right so what we are doing now is we are saying well this is not like random probability distribution but this is a linear equation so if I increase X set one will increase by this this alpha or whatever. So once we have identified this we set some sort of alpha and and fix it there. We do the same for for the second part. Set one to Y and and also get um other alpha let's say beta here and um then basically we just can chain this alpha and beta together and and know the full causal effect. Right? So that becomes then very easy if we can make some assumptions of this underlying um process of this structural equation and also it doesn't have to be a linear equation right it can be any any functional yeah any functional form basically that we are able to to request the the parameters for uh from from our data then it's it's really just like looking at all the paths from X into Y and just chaining all the all the uh paths or all the edges along the way together.
So yeah, then maybe do a brief break till uh 11 uh and then we finish with the whole causal hyper.
Okay. So um maybe let's continue with this uh last maybe also brief part uh which basically is the uh pol causal hierarchy and uh I think you you have seen it before in the first lecture right where it was this doing intervening and then reasoning about counterfactuals and now with the tools that we have Um maybe we can approach it a bit in a more formal way and and try to really see why why we have this individual steps why they are distinct. Um yeah and and maybe really start with the most simple insight or thing we we have seen right before the break which basically is that um often times we we want to estimate um queries that um involve the do operator.
And our goal most of the time is to uh to transform them into into queries or estim that that no longer um contain any do expression. But for some of these um we we just were not able to uh to transform them in a in something without the do. And basically what this means is that there is a distinct class of of queries um which does not live on the observational level so to say. So we can't reduce it to to observational only expressions and they basically form a distinct class and this is what at least is formally meant with um with having a different rank or like being a different class that for some of the interventional queries we we we can't answer them with with observations can't um yeah transform them in the way um then the the next thing is um if we consider both of these classes right so observations and interventions what they usually do is they they give us some some probability about the future outcomes. So what would happen on average if I prescribe this medicine to to a patient? Um what will happen if I increase the price of a particular product uh and so on. So basically what what they do right they they always reason about average effects on on some sort of population and this population is is basically what we define via the noise variables in our structural causal models. Right? So there we make these assumptions how how are our latent factors distributed um what what kind of noise goes into our system and then for the average treatment effect and so on we we basically like average or marginal yeah not marginalize but average these these effects out. So yeah that's that's a thing if we want to make predictions about like general effects what what will happen if we change something in the future what is the average probability of of something happening but sometimes and I mean some people argue at least that um what what makes uh humans or animals also intelligent could the the ability to retrospectively reason about individual outcomes or particular scenario. So basically to to think about counterfactuals um which means we we take a particular scenario and and reason about what would have happened if I had done something different if I had intervened on that particular um yeah variable. Um basically what this means is we we reason not only or not about some average outcome or whatever but for that particular scenario all the noises are fixed right we might not know them but we know okay I don't know the the virus level in my body was at this exact uh quantity And if I had taken the the medicine maybe I I wouldn't uh have uh got sick or something like this. So what we will show again there are formal proof but they are very like intricate and and we we will stay on the um like intuition level here but basically what what is the difference is that that interventional queries reason about like average effects and counterfactuals then form this third level on on this letter of causality which basically do an additional step that reasons about the the unobserved noise and I I will just show you um maybe some examples which hopefully will make this more clear.
So yeah, again this is um what what we've talked about before. So observational really is how many people that go to parties catch a flu, right? So this is a conditioning like maybe there is some confounding between like going to a party or being open to to go there and like immune system. Don't ask me but so this is classical conditioning. Uh then what if we order people to join a party?
So if your professor or boss says okay everyone needs to show up uh at at the company party. Um then then we we have a do operation. It's it's no longer like ex like conditioning uh according to some random factors but we like we know or we explicitly enforce that policy and in the counterfactual it's would John have caught a flu if he didn't go to the party which basically first implies that he went to the party. So we we have some sort of evidence and we don't um reason about some group of people or whatever that that might be here before in in our um in our noise distribution but we reason about a particular individual which for which we want to know the the counterfactual outcome.
Yeah.
So um we've said all of this many similar questions um what if I learned for the exam what if I had smoked uh in the not smoked in the past or what if I had smoked in the past um and yeah really it's it's always about individuals so maybe because we we had a lot of try stuff on on crafts and whatever um I I think there's a really nice way to to show this in in images.
Um so let's say you you do a series of photographs of dogs or corkies and basically on the observational level you get the data set from your photographer and and he basically tells you okay here are all the images and and you just see what what you have observed uh what what dogs you could photograph maybe in a park or whatever.
So, and now you're you're wondering maybe about the effect of the do operator for these kinds of images that that you got. So, on the interventional level, um things would have played out differently, right? you you go back to to your photograph and say, well, actually for my machine learning model, I need to have or I I want to have images with dogs which have an apple in front of them. And what will happen then? The photographer will will go back and he will record a new set of images like he will sample a new set of of data points if you want to phrase it like that. And for every of these new samples, he will like basically record it with an apple in front of of the dog for the corking.
So this is the interventional level. We we don't fix the noise. We we do the intervention, but we maybe record different dogs, different different sets of images.
For the counterfactual, what we will do is we will imagine what would have happened to to our original data set if we had put an apple in front of the dog.
And this is now different, right?
Because here the the pose is different.
It's um also different uh dog of at all and so on. The the background is different and so on. And here it's it's not like that. So the whole image is the same except for the the intervention we we did and let's say okay you you see the the crown maybe changed slightly which is because of our image editing but maybe like putting an apple there had some some more effect maybe it it casts a shadow and so on. So we we will observe the intervention.
We will think about all the downstream effects that our interventions also had on on this model.
But like everything that was not affected by by this intervention will stay the same.
And here because of image editing and so on, we we can like at least imagine how how this image would have looked like.
But the the problem is that in reality there is no chance to get an image of both. We we can only either record this image or this. So it's it's impossible to to have the observational and counterfactual. I mean it's it's in the word counter fact uh uh that that that we have both of these images recorded in in our data set. So of course you can make a lot of assumptions on like regularity or continu continuity of the of the space right we we could say okay we might be able to record two images very close to each other where the dog basically doesn't move right so this is then assumes okay the the noise variables don't change too much like uh this is the image recorded 5 seconds later or whatever um and so on. But like still it's it's not a perfect counterfactual. Still we we make some assumptions and some things in the background maybe like there there's a light breeze of wind and the grass will look slightly different or something like this. So um basically we we can't have them both.
Um but of course and I mean that's that's the the tricky part because we still want to reason about this quantity and people often reason about this. So in court or if you are making an argument right you you want to attribute responsibility then then it's very important that you can argue okay if we had followed my plan then we would not uh like be stuck in traffic or whatever right so it's it's still important to to be able to to make these kinds of inferences although we we will never see or directly be able to compare both data points What if you're collecting data from simulated environments?
>> Oh yeah. Yeah, sure. I mean that and and that's that's something where um because we we talked about collapse of um of these causal levels. Exactly. There there are scenario where it is possible but it's usually some some simulated environment some very highly controlled environment um and practitioners like if you tell them I I can have observational and and counterfactual data they they will like very strongly argue against that um but yeah I mean for making predictions and so on it's it's super important that we identify conditions under which we can make these inferences and and compare this. And this is actually all all the case or the the reason why we do randomized control trials and so on, right? because they basically break certain links um from the noise variables and let us control for the noise to replicate the data or at least let us then reason about the counterfactuals. Yeah. So interesting.
>> Yeah. What what sometimes is done in for example medical studies or whatever is that you do some okay this is then also statistical matching and so on but often times you do some sort of propensity score matching and so on where you like find for example patients that like are very similar in their their features like in their weight or height or whatever.
um and and then you randomize uh things where you treat one of them and do not treat the other one. So you try to like make like have two patients, make them as similar as possible and then see the like divergence of effects from treating one versus not treating the other. Sure, there there's a lot of like ethical questions involved and so on, but yeah, there are ways to to get around this and uh yeah, of course, not not all hope is lost there.
So, but maybe having looked at all of these nice pictures and also maybe having discussed what what this means um there there is a very simple or there is an actual uh algorithm or step-by-step procedure to compute counterfactuals. And basically what's what this task is is to yeah being giving given a model given some uh distribution. So this is your structural causal model and just because we will like modify this we we note the exogeneous distribution outside. We want to compute a counterfactual of EA given E. And what this notation means there there are also others depends on on the author. But this means that we want to find the the probability of some outcome A given that we intervened uh some outcome B given that we intervened on A under some set of observations or some set of evidence E that we know that happened.
So basically to to trace it back to what we said before um the the first thing that we want to do is we we don't want to do or reason about some average random scenario but we want to take the evidence the observations that we have or that we were given to reason about the most likely state of of the exogenous variables. So basically we we take the evidence and say oh if I don't know John went to the party and got sick we we might want to infer some property or likely value of his immune system or whatever.
Then the the next step so this is basically inferring all the noise in the in the actual world.
The next thing that we do is we apply our operation our intervention now to the craft. So before this is the unintered craft, right? This is like purely statistical inference. We infer U given E.
Now we we do the interventions. Uh and I think we've seen that before. And now we we take our updated model with the intervention with the uh updated evidence and from that compute the the outcome or the probability of of the outcome of of B and and these kinds of pairs the the causal model or craft that we have and the the probability distribution or like exact values of U is sometimes called a causal world. Um so that's if you like have different use I mean this is then sometimes up to debate or how you infer this this use. Um then you can basically reason about different count like factual or counterfactual uh causal worlds.
So yeah it's it's this three-step procedure and maybe just let's let's do a simple example. I'd say um yeah um I modified the the scenario slightly. So this time um Tom always messages either one of his friends. Uh it just uh decides for one of them. So either he messages friend A and he takes care of the plant or he messages friend B uh who takes care of the plant. But like it's either A or B, not not both together.
And let's assume we we saw friend A watering the plant yesterday. And we know want to know what would have happened if we prevented A from watering. Maybe we like uh went there, got him into a discussion and he he uh yeah, he he forgot to to water the plant or we just took away the the uh the uh water the the hand from him.
So yeah, this this is the scenario and the question we want to answer and this is a counterfactual question, right? So the the event already happened. We know the the plant is is healthy and we want to know okay but what would have happened if we if it were different if things would have played differently.
Um so basically the the first step is this abduction step where we take our evidence we we saw friend A watering the plant and reason about what does this mean for our you let's say this okay I drew it with a full line but maybe this this is unobserved for some reason so given a we know that m was true we know that u was true so this is basically We know this is true and we trace back through the structural equations, right?
Which basically just copy the value of a and um then u was also the or m is the same value as eu. So we can infer that that u was was true in this setting.
Then we we intervene, right? we we want to know okay but what would have happened if if a didn't flower uh water the plant so we we do an intervention here we we cut this uh this edge uh sometimes I'm I'm not sure if I've mentioned this sometimes interventions are also like marked with with some sort of hammer uh so then we just remove this edge and and no it's like put this hammer here to say okay it's it's not the normal observational state of of the crowd.
Yeah. So this is quite simple exchanging the equation again and um yeah step three unfortunately uh then gives us the counterfactual outcome or fortunately because it didn't happen.
So we we take our model we we have the intervention here we know our U is false and we run the whole system again and what will happen is U is false therefore M will be false um this edge doesn't exist anymore will stay false because we we intervene on it b will be false because um it's it's the opposite of of B. Oh, does it say? Yeah, right. Because uh M was true, B will be false. So all A is false, B is false, and the the plant is no longer healthy or would have not been healthy.
So this this is a quite nice scenario where we have um where we have deterministic equations.
um this seamlessly translates to scenarios where we maybe have some noise in the equations. The only difference then is that we like here don't have like you is true or false like with a 100% certainty but then we we could have some sort of probability distribution right so it it could then say well it's maybe 20% true 80% false or something like this and then we still do all the three steps abduction action prediction again and and find the the outcome or outcome distribution um in that case.
So for for this last part just a very nice intuition because during the previous part we we did infer some U then we did some interventions on the craft then we did rerun it again. Um there is this uh representation called twin networks which basically duplicates the the craft and connects these two crafts um via the um exogeneous variables. And this is maybe a bit nicer because you can do all these steps within a single model within a single craft um without like like doing this this step simultaneously. Right? So you apply your evidence on the left which is the factual world that you have observed.
you infer the use uh here and then from there you go to the right side the counterfactual world where you applied the intervention and um just down propagate your new values to the quantity that that you care about.
So yeah, this is sometimes a bit nicer again. Uh yeah, so last minutes um we we talked about interventional and observational collapse, right? Which which basically tells us that sometimes under the rules of do calculus, we we can transfer these this quantities into each other. And then because the question came up before, there's also an collapse between interventional and counterfactual um runs. And basically, if you think about it, it's that the interventional case, it only does steps two and three of these of the counterfactual algorithm, right? it only applies the um the intervention and then reasons about the outcome but it it doesn't do this first step where we like infer the or change update the probabilities of you. So this basically makes it different from from counterfactuals.
But on the other side, if we have evidence that does not influence our belief in you or that is like irrelevant for for the outcome of um of what we care about, then these these steps are the same, right? like if if the evidence doesn't matter for for the exogenous variables that in the end influence our Y that we care about then then we don't need it then it's it's it's the same as doing uh interventional inference and um again what also what we also can do if we have interventions because um here, right?
If we have enough interventions here so that we cut all the edges that go into U, then of course this result on on D here will also be independent of U.
Right? So then there will also be no difference between the interventional case and the counterfactual case because again we we cut all the influence from uh of you on on our target variables. So in these cases also the interventional case and the counterfactual one collapse. And basically if you think about simulation or simulating environments that could be something right where where you say okay maybe I do not intervene but I either control you or I yeah I I can control the environment so closely and set variables again with with such a precision that the outcome basically is the same as like yeah there is no no difference between interventions and and counterfactuals because I observe everything. I control the whole environment.
Yes. And that's it. Um are there any questions?
If not, then thank you and see you next week.
Vidéos Similaires
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











