Luca provides a compelling defense of the Jaynesian view, framing probability not as mere frequency but as the inevitable logic of science. It is a sophisticated reminder that all rational inference is ultimately a rigorous calculation of plausibility.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Probability Theory is an Extension of LogicHinzugefügt:
So, we're going to talk about probability today and we're going to give a very [snorts] um overlooked and underdeveloped approach that sees probability theory as an extension of logic.
Famously, one of the first scientists and mathematicians to develop this idea was Laplace, who in 1819 said, "Probability theory is nothing but common sense reduced to calculation."
And we will see today what that means exactly.
So, first we're going to look at the different definitions of probability.
We're going to talk about some classical logic, then some lattice theory because this is how we're going to derive our foundations.
We're going to derive the famous sum rule and the product rules of probability that you all know. Uh we're going to talk about how this leads to Bayesian inference with Bayes' rule and then some unique powers of Bayesian inference.
All right. So, nowadays there is two big definitions of probability that kind of contend for the spot of being correct.
And that is the frequentist uh interpretation, which sees probability as sort of the limit of the frequency uh of an event happening, and the plausibility approach, which is the Bayesian approach, which sees probability simply as a quantification of how plausible an event or a proposition is given our state of knowledge or our state of ignorance, depending on how you look at it.
So, for example, imagine um that we're doing the very simple experiment of tossing a coin.
And imagine this is just a regular coin, it's a fair coin, you know, nothing weird is going on. Why do we say the probability is 50% uh 50%? The frequentists would say that because if you keep flipping the coins, the ratio of the two outcomes will eventually uh approach one, meaning uh that the probability that either one the the fraction of either of one happens approaches one half.
Whereas the Bayesian would say that we say the probability is one half because we don't have any reason for prefer any of the two sides given our ignorance of the properties of the the coin.
And therefore we must by consistency assign an equal probability weight to both and hence 50%.
Now there are some important problems with the frequentist definition. The most important one is it cannot assign probabilities to single events. So in the frequentist framework which is the orthodox framework, the one that you were taught at lectures with the T tests and and P values and confidence intervals, it cannot assign probabilities to single events. So for example the statement the probability tomorrow it is sunny doesn't really have a meaning in this interpretation. The probability that a certain person wins an election or that certain person wins a race doesn't have a meaning because these events are unique.
Also the the notion the validity of this notion of probability intuitively relies on the law of large numbers which in turn depends of previous definition of probability which creates some circularity issues.
Also this is probably the biggest the biggest hint that there's something off.
It relies on some some kind of notion of randomness which is very hard if not impossible to define. In fact many textbooks say say it out loud that there is not really good definition for this. Also because simultaneously it requires that the different events and different repetitions have to be also be equal to the others in some way that is well equally hard to define.
Um Also it is you could say epistemologically clunky because it ever only deals with the probability of the data given the hypothesis. This is what you do when you do what's called maximum likelihood. You just see what values of the parameters of your model just maximizes the probability of the data given your model.
However, it doesn't uh give you a way to assign probabilities.
Sorry, it doesn't have a way to assign probabilities to a hypothesis because um hypothesis, once again, I are only or the truth is sort of a fixed thing. It's a constant and therefore this methodology does not allow you to give it a probability.
Also, it has the sort of methodological uh handicap that it ignores any prior information or prior knowledge from either previous estimation or uh prior results in a completely different experiment. And it often provides only point estimates, which makes it very hard if not impossible to assess the uncertainty of the result.
Uh in an in an attempt to circumvent these issues, the methodology has uh made it so that scientists now commonly reason about uh what the world would be in different imaginary universes. So, for example, if you wanted to estimate the mass of a planet, uh this methodology in per se would not allow you to assign a probability to this value because it's a constant. There is only one true value.
And so, it forces you to imagine, oh, what if there were different worlds in which different version of this planet uh existed? And so, you reason about this. And uh I'm I'm sure you are all familiar with uh what a p-value is. Uh And there's a this famous critique by Sir Havel Sir Harold Jeffreys which said that what p implies is that the hypothesis that may be true may be rejected because it does not predict it observable results that have not occurred.
So, uh just as a reminder, the p-value is the probability that you get a statistic that is equal to the one that you got from the data or more extreme, so further up the tails. And while this intuitively seems like a good reason to reject the null hypothesis, uh it also means that the null hypothesis has failed to produce data that is further up the tail. So, uh it's actually not entirely clear why this should be a good criterion to reject it.
Uh and also it's equally it's equally problematic to say why we can interpret the confidence interval at all because a confidence interval doesn't tell us the probability of the null being true or the alternative hypothesis being true. It only tells us that if we repeat this experiment infinitely many times, then say 95% of the time or whatever interval we choose, the statistics is going to be the statistic that we look is going to be in there.
So, again, we have this inversion problem from we have probability of the data given a hypothesis, but not the probability of the hypothesis given the data.
Now, we're going to talk about the alternative, um which has been famously advocated in the last century by Edwin Thompson James, legendary physicist and statistician.
And to do that, we're going to do first of all a thought experiment, and I really want you to think along with this. Suppose that you are a policeman.
Uh you patrol the streets at night, and suddenly you hear an alarm. Uh there's a jewelry store, and a a man with a mask comes out with a with a bag on his shoulder. You arrest him, and it turns out the bag is full of jewels. Now, I want you to tell me, do you think this is a crime? Did you witness a crime?
What would you say? Your gut instinct.
>> Of course, it's a crime.
>> Is it a crime or is this man honest?
Is anybody talking or am I not hearing you?
Okay.
>> That's all right.
>> Don't worry.
>> Okay, I think we can all agree this this looks like a crime, right? And now my second question, and the more important question actually, is did you come to this conclusion through deduction or through induction?
No, yes. Uh it is induction, and you can easily see that because you can imagine you can come up with reasons why this might have happened that are completely innocent. I mean, if you want to say uh a particularly ridiculous one, imagine that the the the the man in the mask is actually the store owner, and there was a fire alarm, and he just uh wanted to save his jewels, and he somehow forgot his keys, so he saw no other way than to uh break in, and he also had the mask on hand to keep himself warm. You know, you could have imagined some ridiculous scenario. This is possible. So, just by the very fact that an alternative is possible tells you that this cannot be deduction, because otherwise uh the result would have been uniquely determined.
And uh unfortunately, induction gets a bit of a bad reputation, mainly because of some philosophers that were not scientists, but that's another point. Um, but we're going to show that what science does for the most part is actually just induction and deduction is mostly the realm of what mathematics does.
And to understand the difference further, I'm sure you're all aware of the kind of reasoning that Sherlock Holmes does.
Now, if I remember correctly, he usually says this is deduction, but also all of these kind of like little tricks and connections that he makes actually all inductive.
And so you can you can tell that even though induction is has this reputation of being very weak reasoning, we can be very, very confident about the conclusions that we take from it.
Now, here are the most fundamental rules of probability.
Uh, in general in the Bayesian methodology, probability is seen as a quantification of the plausibility of a proposition of a statement where a value of zero corresponds to impossibility and a value of one corresponds to certainty.
Now, if your mind immediately jumps to the uh, case of sampling a rational number from a set of irrationals, don't worry, we'll talk about that later because this seems to contradict this.
Uh, but let's start with the basics.
Now, we have the product rule uh, that allows us uh, to find the probability of uh, compound statement of A and B given some context C. And then we have the sum rule which allows us to uh, combine the probabilities of two complements.
And you notice that uh, we always have this uh, given C at the end on the right side of the expression and this uh, means that in the in the Bayesian methodology, we don't think of probabilities as somehow existing in a vacuum or existing as a property of the experiment or the object that you're studying, but they always depend on what knowledge you have at this point and this is uh, why we always write this at the end.
Usually then it is omitted when it's clear that it's always going to be the same letter in your calculations, but it is always there implicitly.
Um now we're going to just quickly talk about some classical logic. Uh I'm sure you're all familiar with this, so we're going to be quick here.
Um we're going to see the probability is an extension of logic where classical logic uh is just a special case.
And the most straightforward way to approach classical logic is with truth tables.
Um and with the operations and or and not.
These are an adequate set, meaning that they uh allow us to express any truth value depending on any input.
Uh there are actually smaller adequate sets, the most famous one being NAND, which is used extensively in computer chip design, and there's also XOR.
Um but and and or are just very convenient, and so uh we will use them.
Now I just also want to draw your attention to these to this notation. And and or uh the and is somehow described as this like uh hat.
And the or is kind of this valley. While the not is just a line over the actual proposition.
Here is a some uh famous truth tables and what they might look like. These are actually kind of important uh for the rest of the presentation, but I'm not going to go into them. If you ever wish to have some explanation for why something happens, I can come back here and show you.
So how can we check if some kind of complex compound statement uh can be true?
We can do what is called reduction to the disjunctive normal form, meaning we uh split the statement up into all of its uh minus uh it's all of its smallest form, so to speak, by applying the join operation, meaning the and operation.
And then we can we could combine those again with a with increasing amount of or operations to come back to the original statement. And if any of those atoms that cannot be further combined with with meet statements, um then the whole if any of those can be true, then the whole statement can also be true.
Now this doesn't tell us whether it is true or not, but this is how we find out if it could possibly be true. Because if all the atoms have contradictions in them, then the compound statement obviously cannot be true either.
For example, the compound statement A or B can be divided into A individually and B individually, but then also further into A and not B, A and B, not A and B, and not A and not B.
And these are then the atoms because you just this doesn't make sense to combine them further with with meet operations.
Now from from this operation, we can actually obtain some kind of order by implication. So we know that these are the most fundamental building blocks here at the bottom, and these statements above are just the combinations with the with the join operations. And this means that this element here implies this one above.
Um and here in this representation actually not A and not B is ignored because our top element here is A or B.
And if we assume that this is true, then at least one of A and B must be true.
And so the statement is necessarily false. So A and not B is sort of the complement to to the top statement.
Therefore, it doesn't feature in this.
And um exactly. We assume that this top statement is true.
And from this we write it with this arrow, skinny T symbol.
And from this we can essentially obtain a lattice that has elements that are ordered by implication.
They also have this optional bottom element that is just the absurdity, the contradiction, this makes sense. Um and to understand what's going on here, it's very easy to kind of see an isomorphism to a lattice of sets that are ordered by inclusion, by set inclusion, where um um you just see uh you just basically make an equivalent of the the labels of which statements can be true, and you can see that E F contains E and F, of course, and this is this is more or less the fact that E implies E or F.
And it is with these lattice structures that we're going to uh find the the fundamental rules of probability. If any of you are familiar with the work of uh of Cox, he derived the the rules of probability by assuming also uh continuity of the probability and differentiability, and the beauty of this approach is that it does not assume these things.
Uh it's much more simpler.
That's much simpler, sorry.
So, what is a lattice? Um I'm assuming most of you have probably seen this before, so I'm going to be rather quick here as well.
Um a lattice is a kind of partially ordered set, or they call it a poset.
What is a poset? A poset is a a set with a binary ordering relation with the property of reflexivity, antisymmetry, and transitivity.
And we can also say that if A is not B and B is uh above A, then B is strictly above A.
>> [snorts] >> Um And we can also use uh this notation to show that B is sort of the immediate superior to A.
To understand what a lattice is, you need to understand two more concepts.
Um imagine uh a subset X of a of a poset P.
We can talk about an element A in P that contains every element uh of X, meaning it is superior to all of them in the in the hierarchy. Then A is called an upper bound of the subset X. Then the least upper bound is sort of the the notion of the thing that we would most intuitively associate with an upper bound, and it is the element in P which is an upper bound of X and is contained in every other upper bound of the subset. So, it is, as the name suggests, the the lowest of all the upper bounds.
And uh dually, we can also define the greatest lower bound, which is simply the lower bound that contains all other lower bounds. We must invert the containment operation. And a lattice is simply a poset for which the least and upper bound and the greatest lower bound exist for all pairs of elements in the set.
The lower upper bound between A and B is commonly denoted as, again, with this kind of valley notation, which is called the join operation, whereas the greatest lower bound is this hat, meet, and these symbols mirror those used in Boolean algebra, uh because uh when you treat propositions as ordered by implication, the logical or and logical and operation act exactly as the join and the meet operation.
So, there's this uh nice correspondence that also makes the notation much uh nicer for us to use.
Uh then also we speak of a distributive lattice if we have some kind of distributivity property uh of the and over the or.
And there's also an even more restrictive class of lattices which are called Boolean lattices for which each element has a complement. And a complement is simply an element um for which the join is the top element and the the meet is the most bottom element. That's simply what that means.
However, in this derivation, we're not going to need Boolean lattices.
Distributive lattices are completely sufficient.
Uh which is has some practical implication, but this is not important right now. So, now we're going to define the objective of this derivation. We want to basically generalize the zeta function. The zeta function uh Uh, is in its classical form just an indicator that tells us if an element is below uh or equal to another and zero otherwise.
Uh, in our context, we're looking to kind of generalize the inverse, so the one that tells us um if a proposition is above. However, we want it to be not only just a binary indicator, but to also have some kind of um some continuity, meaning we have some degree of implication. This is what we're looking for. So, we're looking to have something like this function Z such that it is one if the element X is above T.
It is zero if the two meet at the bottom of the lattice, meaning uh they don't imply each other at all.
And we have some value between zero and one otherwise.
And this generally this generalization of the inverse zeta function is then what we're going to call probability. It respects the ordering of the zeta function, but allows for incomplete information.
And we're going to derive the rules of probability by looking at some symmetries in these lattices. The most the first one is not really a symmetry, it's more of a convention, and it's simply that elements that are higher up in the order in the hierarchy uh are just evaluated by higher real numbers.
That's all it means. And in general, for the rest of the presentation, uh the capital letters will represent lattice elements, and small letters will represent the real numbers, which correspond to their evaluations.
Now, the first symmetry is that the combination uh preserves order from the right and from the left. So, if we have two elements, one is strictly above the other, then uh the join operation uh makes it so that kind of the compound statements uh also have the preserve the order, and no matter from which side you add new element. And by extension, this must also hold for the operations that quantifies the join operation of these disjoint elements. So, if you have this uh state this proposition D, which is A or C, then the valuation of D must be somehow a combination of the valuation of A and the combination of C for this uh kind of plus operator that we will see is going to turn out to be the sum. And so, here we have the same symmetry reflected uh with it.
And this basically means that the ordering has to survive a combination with any arbitrary context. Otherwise, it it's basically useless for any kind of reasoning.
Uh to put it in set theoretical language, if X is strictly in contained in Y, then if you uh if you add another set to both sides, this kind of ordering relation, this containment, does not change. And to put it into a more practical example, um we all we know that all dogs are mammals, but not all mammals are dogs.
Therefore, being a dog implies being a mammal, which we could write as uh like this.
Now, if combination didn't preserve order, we could encounter situations uh such that dog or fish is somehow implied by mammal or fish, which is of course absurd. Um because uh these additional possibilities for what it the the animal really is shouldn't change the the order of dog and mammal.
And then there's another symmetry, which is simply that the combination is associative.
Um And this follows simply from the fact that the the join operator is defined as the lower upper bound, which just takes in a set of elements, and so it doesn't care what grouping at all.
Uh it doesn't matter which order uh we do it.
And so, we're going to use the most important theorem of this whole derivation, which is the associativity theorem.
Uh we have two disjoint elements of a distributive lattice with their respective valuations.
Now, the two symmetries uh it translates to axioms imply that there is some order preserving mapping for which this kind of combination operator uh, is addition.
And the beauty is that since this is the first operator that we are kind of looking at, we can also just immediately throw away these original valuations that we had for all of these propositions and just use sort of theta theta of the original valuation and just use addition with them.
So, we can use uh, sum for the combination of disjoint elements without any loss of generality.
Now, the proof of this is too long for me to present it here. It's also kind of annoying, uh, but uh, you can find it in this paper by Kevin Knuth and John Skilling. It was published in 2012.
And uh, there there the paper is super interesting and it rests on like a decade of work by uh, those two.
The only thing that I would recommend that I would say if you ever read this is to be very careful because they are a bit rough sometimes, but it's absolutely a gorgeous paper.
And now we can already find one of the familiar rules of probability where we can kind of get a hint of what it's going to be. Um, we can send out kind of see the combination of arbitrary arguments because before we only saw that we could use addition for disjoint elements, but imagine that we have X and Y which are kind of have some elements in common, so to speak.
And then if we want to have their valuations, uh, we would could of course just sum their their components together, but if you want to express it as a function of the elements of X, uh, sorry, of the the valuation of X and the valuation of Y, then we find that we need to subtract the valuation of the element that they have in common. And this leads to the general inclusion-exclusion uh, sum rule that is usually taught with uh, Venn diagrams that are overlapping stuff like this. I'm sure you've all seen those.
Um and interestingly, we also can immediately find out that this um uh the commit the join and meet operations are associative because the commit these the real numbers are commit No, sorry. Did I say associative? I meant commutative, sorry. Because the real numbers are of course commutative.
So, since the valuations have this property, uh also the algebra of the individual elements must have it.
Then we're going to introduce a new symmetry, uh which is Imagine we have two independent systems, S1 S2, that we want to combine. We sort of want to consider them together. And these two systems would then be two different lattices. Um And this forms a new space with the type uh defined by some kind of cross product of their elements.
And the direct product operator uh is right distributive over the join operation.
Therefore, the operator that we choose for this this kind of uh crossed out o must also satisfy this property. And why do we think this is reasonable?
Uh because um if we imagine that we want to analyze coin tosses and die tosses at the same time, um let's propositions refer to the outcome of the coin toss just be H and T, and then we have D K with K just being one of six, not outcomes of the die.
Um and we treat them as independent. Um if we had it that somehow this uh the quality before was not true as shown here, then it would mean that somehow considering the outcomes of the coin toss would be different if you paired them with the outcomes of the die, uh which would kind of mean that there was some kind of entanglement, some kind of relationship uh that uh directly contradicts our assumption of independence.
And this This us to a very important point. Uh this is more like a parenthesis, but it's very important.
Independence is classically seen or in the orthodox methodology somehow a property of the experiment. You have independent trials, independent draws from distribution.
However, the better way to reason about it is independence is epistemological, meaning it's just a property of our knowledge.
Formally, independence simply means that learning about one system doesn't tell us anything about the other.
Now, to illustrate why in my opinion it's better to say this is epistemological is imagine the classical Bernoulli urn, right? You have red balls and white balls.
You draw one and you record the result and you put it back in, you shake it and whatnot.
Those this kind of experiment is typically described as IID random variables, right? You have always the same distribution of the same balls and all the draws are independent, meaning seeing one doesn't tell you anything about the other. However, physically speaking, they very much depend on each other.
I mean, the the next result definitely depends on which color was previously drawn and then from where this ball was placed again into the ball, how the ball was shaken and of course and all the attrition coefficients between the walls of the and the balls and between the balls themselves, etc. But unfortunately, all of these influences are so incredibly complicated to track that our attempts to extract any kind of information from them would be almost futile.
And so the draws are definitely not physically independent. However, we just because we don't know how to do any better, we ignore all of these non-linear dependencies. And in fact, confusing physical dependence with logical independence has led to leads to all sorts of paradoxes in practice. Um Famously, in quantum mechanics, this this is uh you have some paradoxes because of this, but in all sorts of other fields as well.
So, this is distinction is very important to keep in mind.
Then we have a new symmetry.
Uh we assume or we see that the the the direct product is associative.
And therefore also the operator. And um again, this would mean that if this were not true, then the nature of the elements of a composite system would somehow depend on the order in which the independent contributions were added.
Uh meaning, if we somehow combined coins and die first and then added somehow an analysis on the weather, just some uh random example, would yield a different uh joint space, different probability space, than if we combined die and the weather first and then added the coin, which makes no sense, of course.
And we can also to apply our associativity theorem, we also need symmetry one. We need to show it for this new operator.
The direct product also preserves order.
Uh meaning um that if we have some kind of uh uh proposition Y that can be decomposed into uh X or D, if we apply the direct product, we obtain uh this. But since neither X nor U nor D times X are the bottom elements here by assumption. Uh this is definitely possible.
Uh then it must mean that somehow Y times U must have a higher valuation than X times U because in order to obtain this one, we must have combined X times U with another valuation. And so, we can write this as such.
To see that X times U is smaller than, well, this uh combined statements, which is smaller than the top one.
And so, we have symmetry one again also for this uh product.
Now, interestingly, we can uh apply the associ- associativity theorem again, which is very convenient, of course.
Um so again, we have two propositions um of the that are elements of the lattice and with the evaluations and we must and we know that somehow they must satisfy this. However, while before with the combination, we could just simply throw away the old valuations and simply use the the the new ones given us by the mapping, here we cannot do this because the plus operation was already chosen uh for the combination of the joint elements. So, we must find uh something else.
And to do that, uh we must find a form of theta tilde.
To do that, we um set up some functional [snorts] equation uh because uh theta tilde needs to be consistent with symmetry three and with the fact that the combination operation is addition. So, we can set up this kind of uh equation.
Uh and if we define phi tilde as the opposite of theta with the inverse mapping, we can write uh this equation where the the plus sign here uh is used uh this because it's between the different elements of theta and here the different valuations are just summed together because those are disjoint elements being combined uh by the first operation that we saw.
Um and once again, there is a nice little proof for this and we find that phi is actually c times e to the ax, which in turn means that we can write uh this beautiful equation where 1 over a simplifies and we are also free to choose c equals to 1 with this just a constant with no loss of generality.
And this in turn tells us um that this kind of product operator is just multiplication, uh which is of course the product rule that we all know from probability. If we have uh, the joint probability of two independent uh, random variables is just the product of the individual probabilities. This is essentially what we have derived here.
Now we're going to talk uh, we're going to go back to our idea of the zeta function because if you remember zeta function actually always took two elements. Uh, it compared sort of it always was evaluation of some kind of range uh, of the lattice. So we essentially are trying to build a b-valuation, not evaluation of single elements of the lattice.
Um, in general, we're going to just describe this as the b-valuation with let b x and t where the order inside this uh, matters a lot. So the right one is the top one and the left one is the bottom one.
Um, x is commonly called predicate and t is called the context.
Uh, and note that t does not need to be the top element of the lattice. Uh, t can just be any element. Uh, this is we want this to be a very general kind of operation that we can do.
Now these b-valuations, because they are essentially just ranges o- over sort of connect arrays of the elements of the lattice, um, they can be ordered by set inclusion. And consistency requires that um, smaller uh, smaller ranges have uh, higher valuations and sort of valuations of two elements that are further apart uh, have smaller valuations.
The reason is that if we have the same context, then an element that is further down the lattice um, needs to have uh, a lower valuation than an element that is implied by the same element x, right? Uh, for example, if uh, what's a good example to illustrate this?
Uh, well I think this one's kind of obvious. it's the opposite that needs to be explained. Um, but given the same predicate, also we know that a more diluted context gives a lower valuation. So, this one needs an example. So, imagine that you want to define the probability that a person is in a city, let's say Paris for example, right?
So, the probability that the person is in Paris given that you know the person is in France is definitely much bigger than the probability that they are in in Paris knowing only that they are somewhere in Europe. Uh, the the second context is much much uh more diluted.
It's much Being in France implies being in Europe, hence uh France is strictly below Europe in kind of this ordering by implication. Uh, and Europe is just so much more diluted. It allows for so many more possibilities.
Uh, and this is what we're essentially reconstructing here, what we're obtaining.
And this B-valuation is then what we uh call probability, where again here this B-valuation of the range X uh and T is the probability of X given T.
Um, and the interesting thing is uh given a context, then the valuation of all the elements sort of below it just reduces to the valuation of those single elements, which is why we can apply those uh results that we saw before is only caring about valuation of single elements.
And this allows us to write the the sum rule, which is very nice.
Uh, so we already have this. And we can also use the product rule for independently treated systems uh like so, uh where the top element uh so that the combined context is again T uh context one times context two.
And just to illustrate what this would look like with some kind of uh with the lattice case, uh so a quick example, let's take these two simple lattices with just two atoms on top element and bottom element, and we want to say find the valuation of uh A times X. Um, then the top element here becomes t1 times t2. We can use the distributivity property to obtain that this is the top element of the new lattice that we're going to uh get.
And again, note that neither t1 or t2 need to be the top element of their respective lattices. This could just be These two could just be sub-lattices of some kind of bigger structure. It doesn't matter.
And the combination of them uh results in this. Um And this is what we're doing when we are combining two uh systems that we treat independently. We kind of create this new bigger structure that has all these uh cross product points.
Um yeah.
Now, the next thing that we would that we need to do to have some kind of complete uh reasoning apparatus is that we need to somehow quantify the degree of implication between two elements that are not directly one above each other.
Because if they are above each other, uh you can somehow just uh combine the the all the elements uh with the join operation, with the sum. But if they're not directly above each other, what do you do?
Um So, for example, imagine the chain where all these elements are directly one the superior of the other. Then we somehow need to obtain the valuation of the in of this generalization of the inverse zeta function uh of over the whole range of x to t.
Um and we can have to find this from all the sub-intervals, x to y, y to z, and then finally z to t. We need to somehow be able to combine those to get the the bigger valuation. And this, mind you, is an entirely different operation than adding independent systems together, but it turns out that this will also be uh a product rule.
Now, we have the fifth and last symmetry that we are going to look at.
The chaining of these uh intervals in the lattice is associative.
Meaning it doesn't really matter in what order we do the chaining operation.
This simply follows from the fact that implication is a property of the intervals as a whole. And so it's absurd to make it somehow depend on our way of chaining them together.
As long as of course we don't just skip elements and won't do anything do anything stupid. But given that this is this is a valid you know choice of elements it doesn't matter how we combine them.
And then of course the operator which is kind of this bull's-eye bull's-eye symbol must also satisfy this. This is just written down with the probability notation.
And we in order to use our beautiful associativity theorem once again we unfortunately need to show that it satisfies symmetry one yet again. I'm sorry.
And we show that chaining of intervals preserves the order from the left and the right. So again if you have this element XYZT then given the same context the probability of X is strictly lower than probability of Y because it's further apart from the context. It's a smaller it's a smaller thing. It's a smaller subset of the possibilities.
Um then if we kind of chain them so that both both have kind of a larger context we know from the property that these B valuations are just um sort of these these these ranges we know that this still preserves the order. And dually for the other case if we have the two propositions that have different contexts if we somehow um chain them together such that they have another new common proposition that is further up the order then this also preserves the original order.
And now we can once again apply a beautiful associativity theorem yet again.
Um I just gave it a slightly different name with theta hat because it me it always pisses me off when authors reuse the symbols in different contexts, so I thought I'd do the courtesy of not doing it. And once again, we know that somehow some ins- for some uh theta hat for some mapping, we have addition with this operation is going to be addition again.
Um once again, we have already chosen addition for our first operator, and so we can just simply recycle the proof from our previous case. And so this obviously gave you already the hint that it's going to be another product rule.
Uh we set up kind of the very similar kind of um consistency check that must be consistent with the operation of joining to disjoint elements that we just for which we know for which we have chosen addition.
Um we again define the opposite. We find this very similar structure to before, and through the same proof in fact, we find that V is once again C to the AX.
And meaning once again, we see that this uh chaining operator is also multiplication. So the probability of A given C is equal to the probability of A given B and times the probability of B given C.
Um and there we go. Now we have derived basically uh all the probability calculus of the Bayesian inference machine.
And we just need to find the kind of more general form of the product rule.
Uh and we can do that by simply noting that probability of X and Y given Y is equal to the probability of X given Y if X is strictly below Y. Uh if X implies Y. Uh that's simply because you have to remember that these valuations are just uh, these sort of ranges on on the lattice and obviously these two describe the same because X implies Y already.
So, these two are equivalent. And so, if we have a probability of A and B given C, we can kind of expand that by adding another NC at the end because this doesn't change our valuation. And then we can kind of split it up into two parts and we use our chaining uh, like this. Here we sort of condition on B and C and then we go from B and C to just C.
But these two, uh, can again be simplified. The right one just becomes probability of B given C and the first one just becomes probability of A given B and C, which is always referring to to this property here, right?
And very interestingly, we can also write this as probability of B given AC times probability of A given C. So, this probability of A and B given C can be written in two different ways and we can exploit this fact to write down this equivalence, which is just Bayes' rule. So, Bayes' rule is just a consequence of this, uh, product rule for chaining uh, different valuations.
And I'm sure you all know just how powerful this rule is.
Um, one last thing that we need to show is what happens when we choose a context, some an element T that is not the top element of its lattice. What happens when we somehow choose something else?
Cuz I just uh, kept saying the whole time, "Well, well, it doesn't really matter what it is, you know, you can just use whatever." But how does our valuation change? Um, we can just, uh, go evaluate the extreme case uh, where we have this this is obviously the valid form that we explored before. But what if we have the probability of the sort of the context being itself? The the given itself and obviously the product rule must still apply because it's true for any any kind of range or any section of the lattice.
And this means that the probability of the context given itself must be equal to itself squared, so it must be equal to one.
And this means that very conveniently our measure is just always normalized.
It uh it always sums up to one.
And here we have Bayes' theorem in its full power um because it allows us to do something that a frequentist methodology just not allow us to do, meaning we can update the probability of a parameter rigorously.
So, imagine you have some context for your investigation, your experiment, whatever it is, and we call it T.
And you want to learn the parameter theta from data D.
Then by simply applying the product and the sum rules uh you obtain Bayes' uh theorem and you can simply write it down as this.
Uh you have the probability which is called the posterior, the probability of the parameter given the data and the context, which is equal to on the numerator you have the prior probability of the parameter.
Why can't I highlight anymore?
Probability of the parameter given the context times the likelihood, which is the probability of the data given your uh parameter.
And on the bottom you have some kind of normalizing constant which makes it so that this all sums up to one and it is simply obtained by um summing over all possible values of theta and then having all possible for all possible values summing over this expression in the numerator.
And again, usually for gravity the context T because it's common in our expression is omitted. However, remember that in this view of probability we all it is always a measure of a degree of implication, meaning it's always by necessity a B valuation.
There's no such thing as a properly truly in conditional probability.
Somehow floats on its own in the no kind of platforming world. That's not a thing.
Um then we of course need to spend a little bit we need to say a word on prior probabilities because that is a major uh point of contention and is also why the Bayesian approach gets such a bad reputation uh because where do you get those prior probabilities? I mean you don't you haven't done the experiment yet by definition. So, what the hell do you know? Um and unfortunately, the language has caught on that somehow the probability the prior probability represents the belief of the researcher and this makes it sound like you can just somehow listen into your gut and and choose the numbers yourself. Uh and in fact, there is a school of probability which does something like this, the school of the de Finetti. But uh I and many others, including Edwin Thompson James famously, uh reject this and find it totally absurd.
And uh it has to be pointed out that while the word subjective is usually uh used for something that it is not valid, actually subjective is not quite the same thing as arbitrary.
Um because probability is in this sense necessarily subjective because it always depends on what knowledge that you have.
So, people with very different knowledge might assign very different probabilities to the same uh to the same notion to the same uh proposition to the same event, right? Uh but what we require of this uh probability calculus is simply that two different researchers with uh the same information will give the exact same prior probability and this is what is called the objective uh Bayesian school.
And now we're going to look at some of the amazing things you can do with this uh with this cal- of this reasoning apparatus, this common this quantified common sense.
Um, if you want to know more as to why it's quantified common sense, once again, I recommend you the book um, by Edwin Thompson Jaynes, which is called Probability: The Logic of Science.
Amazing book. Just read the first two chapters and you immediately see what he means. Very easy and quick to read and super interesting.
But now, assume that we have again some context. Oops.
A context with a parameter A uh, that we will for now, just for simplicity, assume to be discrete. And this parameter can take values 1, A1, A2, up to AN.
Since these values are mutually exclusive and disjoint, um, and exhaustive, we know that the the combination of all of them uh, must sum up to one. So, the the probability of their combination must sum must be one. Uh, because otherwise there's possibilities that we have not considered.
Uh, and since this notion is already a consequence of the context, because the context tells us, "Okay, yes, we are looking at this with this parameter."
Uh, then conditioning on the data does not change it. So, also the probability of either one of them being true, given the data and the context, is still one. That doesn't change.
But now, what if the model has also another parameter, omega, which is the one that we are actually interested in, uh, that we want to estimate, that we want to learn from the data, um, meaning we want to find the probability of omega given the data, and not probability probability of omega and A given the data, where here comma simply means it's just an and operation, right? You want to know both simultaneously. But in this case, we don't want that. So, how do we get rid of this what is called a nuisance parameter, A?
The way we do this is we apply the sum and product rules.
Um, we start from uh, this statement that we just saw before, that is still equal to one, and we uh, we split it up. Uh, we apply the product rule and basically we have omega given all of this combination of the parameter and times the probability the prior probability of all these possible values of the parameter. And since this is equal to one in the second part um and this parameter does not give us any additional information on the second parameter, uh this essentially all of this is equal to just the probability of the parameter that we want. So, this is the expression that we're looking for.
We have found it. Um And this is again a consequence of the fact that this kind of massive compound statement is already contained in the context. And so, it doesn't bear any additional influence on the knowledge of omega because it doesn't add anything that is not already contained in T.
So, how how do we make use of this? Um we make use of the distributivity of logical operations. So, we have seen that this expression gives us the probability that we want.
This is the thing that we saw just above.
And now we just simply make use of the distributivity of the and over the or operations and we obtain kind of this massive compound statement where we have omega and A1 or omega and A2 and so on or omega and AN.
And since the or operation is always a sum, we can split this up into different probabilities that we just simply sum together.
Um and this gives us a general recipe for essentially what is called averaging out the nuisance parameter. So, in a continuous case, we obtain in this beautiful simple formula that we can just integrate out the nuisance parameter. We just need to sum all the probabilities of the kind of the value of omega that we're looking for over all possible values of A.
And this is very very very powerful because in a high when a problem is very high dimensional, trying to estimate all the parameters simultaneously is often impossible or it's intractable with with the computation that we have available. And this kind of averaging out makes many such problems approachable so that we can actually get an interesting answer or any answer at all actually. But in the frequentist practice this is impossible simply because they don't assign probabilities to parameters and so all of this reasoning with probability of A the of omega and A does not make any sense.
Um And now another, which is probably my favorite feature of the of the Bayesian probability calculus, is that it has a built-in Occam's razor. It is a quantification of the notion of Occam's razor.
Um This idea is usually kind of seen as a qualitative desiderata of scientific theories.
The the exact words of Occam can be translated as things should be increased in number beyond necessity and which is usually summarized as all else being equal the simplest explanation is usually the best one.
Um Bayes' theorem has a mechanism or the Bayesian calculus in general has a mechanism that does exactly this.
Uh meaning more complex models um penalized if their additional predictive accuracy doesn't make up for their predictive flexibility.
And we can see this as follows. First, we need to look at how we would even compare different models.
And so once again we just remember that in the Bayesian context we can assign probabilities to any proposition and that is also the proposition of model I of a choice of models is being true.
And so we can once again write this down in the in the Bayes' theorem form.
And again the the term in the denominator is just a The constant obtained by summing over all possibilities over all models.
And com- model comparison is um very analogous to parameter estimation. Um the posterior probability for model is proportional to its prior probability times its likelihood. And now to compare two models, we can compute their odds ratio. We compare them by dividing one by the other. And the reason we only have um Exactly. Here we have the original ratio, and the reason why we can write it as such with sort of the top the num- the numerator of model I and the numerator of model J is because um this normalizing constant they would have in common because we are studying them within the same context and with the same data. This is uh what we're doing here. And so we can obtain this kind of ratio, which is very interesting. And since usually we don't have any kind of reason to prefer any model over the other, we must by consistency assign equal prior probabilities to them. Meaning the probability of model I is equal to the prior probability of model J.
And so this odds ratio, it's called, um becomes the ratio of the global likelihoods of the model.
Uh like so. And this little expression is usually called the Bayes factor, which is why I denote it as B IJ.
And um these likelihoods are calculated from the product rules as seen before.
Um if say for example uh model I has only one parameter, theta one, then its likelihood will be just this, right? You um you just then aver- average out the the the parameter, and you just obtain the complete for the likelihood of the model. This is essentially the observation that we have seen before. However, if the alternative model J has three parameters, then its likelihood is this massive expression.
Um and because of both prior probabilities, meaning this probability here, uh of parameter one from the first model and of the three parameters of the second model, they both must individually add up to one because they're probabilities.
Uh it means that in the second case, the prior probability, this one right here, is much more {quote} diluted, meaning it is um distributed among more parameters and more values, meaning each value for each parameter is much much much less likely to begin with.
Um and if this means that if we add a parameter that doesn't drastically improve the likelihood, severe drop in prior density caused by that new dimension will drag the entire integral down.
Basically, the model is the the probability calculus is mathematically penalizing uh models for exploring useless dimensions.
Uh and once again, the frequentist methodology cannot do any of this because they don't assign probabilities to parameters and uh also therefore can't use this. Uh and they also don't use priors, of course. Um instead, in order to um somehow avoid the overfitting of their models when they do maximum likelihood estimation, uh they have to invent all sorts of ad hoc rules that don't follow from the rules of probability and therefore are suboptimal or even wrong.
And some of these ad hoc fixes are very famous. You're all familiar with them.
We have validation data sets, cross-validation, regularization, uh some kind of information criterion.
And especially in the neural network uh side, we have early stopping and dropout and all of these other strange and arbitrary rules that somehow try to avoid uh you know, just the model finding the greediest way to fit all the points in the data perfectly.
And uh so we come to basically the conclusion of the presentation, which is the Bayesian probability calculus teaches us that the best practice or good practice is simply that since the rules of manipulating probability as a degree of implication are unique, it means that any deviation from them is bound to lead to some suboptimal paradoxical or even really really incorrect results.
And that is why it's very important in the literature there's great focus on somehow not deviating from these rules.
And in fact, there's so often when someone tries to invent a new trick that somehow doesn't follow from the rules, and it doesn't take long for someone to find, "Okay, this actually leads to some paradoxes down the road."
Um And additionally, one needs to be very very careful with handling infinite sets. This is what I mentioned at the beginning.
And carelessness in this context can lead to paradoxes that has that have nothing to do with the theory of probability, but instead follow from the unsound mathematical practices. Some of the famous paradoxes are the Bertrand's paradox and the Borel-Cantelli Borel-Cantelli paradox. And you can avoid them by simply using finite sets or infinite sets that are produced as the well-behaved limit of a sequence of finite sets. So, if you have once again this question of like, "Okay, you have somehow the real numbers in some kind of range, what is the probability that between that you cannot pick out a rational number from this from the space?"
The the frequentists or measure theoretic probability calculus will tell you the probability is zero.
But that is of course absurd because it is possible to draw a rational number.
And so, this is an example of a of a result that leads to paradoxes, and it is actually just a consequence of an ill-posed question and has nothing to do with the probability calculus per se.
Um yes.
And there we go. That's it.
If you have any questions, feel free to to I hope it was interesting.
Oh, you couldn't see the slides?
Ähnliche Videos
A Number Plus 5 Is 12
MathGirlTutor
101 views•2026-06-03
Olympiad Mathematics | Indian | Can You Solve This One?
PhilCoolMath
650 views•2026-06-03
Escaping the Fog
LogicLemurGaming
760 views•2026-06-03
H2 Math June Holiday 2026 Intensive Revision | H2 Math Tuition by Achevas #singaporemath #h2math
AchevasTV
304 views•2026-06-01
A Brutal Radical Expression Made Easy! The Shortcut Changes Everything.
tamoshop
112 views•2026-06-02
V : jee main /advance class 11 mathematics : Binomial Theorem class-1 ( 29 may 2026 )
dcamclassesiitjeemainsadva9953
125 views•2026-05-29
Is This Pentomino Tileable?
3cycle
241 views•2026-05-30
This Sudoku Has Many Lines!!
CrackingTheCryptic
2K views•2026-05-29











