Pritchard provides a masterclass in moving beyond simple genetic associations toward a truly mechanistic understanding of human traits. By integrating GWAS with perturbation data, he offers a rigorous framework for deciphering the complex regulatory networks that actually drive biology.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
CSHL Keynote, Dr. jonathan Pritchard, Stanford UniversityHinzugefügt:
the absolute pleasure of introducing Jonathan Pritchard for our second keynote. Jonathan is the Bing professor of population studies in the department of genetics and biology at Stanford.
Jonathan has a range of official honors ranging from fellowship in the National Academy of Sciences and the American Academy of Arts and Sciences.
But beyond those titles, Jonathan is simply a leader in human genetics whose work has influenced all of our work. More than once, his thinking has led us in new directions.
From his seinal work in population structure to his recent work in complex trait genetics, he has influenced our thinking. As I was preparing for this introduction, I asked some of our peers to weigh in on what was the most important things that he had done.
And the answers that I got highlighted some of those pieces of work, but actually um there's one that I want to quote directly and I won't tell you who it's from.
What this person said was um I don't think his scientific strength is only that he's technically brilliant, though obviously he is. It is more that he has a kind of scientific vision. he seems to be able to notice what the rest of the field has not quite yet noticed, especially when everyone is working within assumptions that are starting to limit our progress.
I thought that was a particularly poignant message. Um, so I wanted to highlight that. I also asked Jonathan if there was anything that he wanted me to mention today. And in his very Jonathan way, what he said was, you know, please just sort of minimize my accomplishments.
Don't mention my mile time.
and this was my best attempt at that.
So, thank you, Jonathan, for being here today.
>> Uh well, that that was just so kind and um you know, it's such an honor and pleasure to be here. Um you know, I've been coming to Cold Spring Harbor um I think since the late 90s and to this meeting at least for 25 years. And so it's just such a a great honor to be here and then especially to be um introduced by Alexis. It's um you know it's so great to see you know one's trainee doing going on and doing such incredible things. Um so anyway um today uh I want to tell you about work that we're doing um to to think about um what we are trying to conceptualize sort of a a new um mode of human genetics where we're going to put together um traditional kinds of approaches that we've been um the people that the field has been developing in uh association studies over the last 20 years and put this together um with modern high throughput perturbation studies. And we think that we like by putting these kinds of tools together um we can build um much more mechanistic and interpretable models that we refer to as trait causal models.
And okay, so first kind of question that we might ask is um you know what what are genetic associations good for? And I would answer that by saying that um they play a unique role in human biology of giving us causal links from uh variance or genes to phenotype. And and of course you might say, well, you know, are they really causal? You know, because what we're doing when we do an association study is, you know, fundamentally um we're testing for a correlation between a genotype and a phenotype. And you know so we might think well you know correlation u is not necessarily causation but actually because we are born or in fact uh we are you know from from the point of fertilization our germline genotype is fixed then for for purposes of causal inference we can usually think of genetic variations acting as an instrumental variable meaning that it sits upstream of all of the environmental influences that we experience during our lives. And so this gives genetic associations this uniquely powerful tool of sort of allowing us to get a um sort of genome scale very high throughput assay for all of the variants and and genes that may influence any given trait. And with that kind of information in hand we can potentially learn a lot of things. So in particular um there's been an enormous amount of interest and success in using G-W was to learn about biological mechanisms. Uh we might want to identify critical genes that could serve as drug targets. So genetics tells us that a gene um may play a causal role in a in a phenotype um in a way that observational data other kinds of um studies really don't.
um and then I won't talk about it today but of course um G-W was allows us to do genetic prediction and patient stratification many other applications as well. So just to set the stage um we can think about Manhattan plot um what is the kind of information that we get from G-W was I'll start with something that's a little bit light-hearted perhaps. So this is G-W was from um from 23 and me and um a large number of other colleagues using um UK bio bank and 23 and me data. Um they ident they they did G-W was so here this is represented Manhattan plot. So every snippet is represented as a single dot here across the genome. the wax is measuring significance. And when when they did this analysis in 700,000 participants, they m found more than 350 independent regions of the genome that are contributing to whether or not you're a morning person. And if you're familiar with the uh the great work that's been done uh on studying circadian rhythms, in fact, this uh you know, this was awarded the Nobel Prize in 2017. you'll see many of the the core genes are involved in circadian circuitry inc including cry 1 and the perur one two3 paralogues um as well as various other genes that um have um plausible mechanistic roles in in whether you're a morning person or not um now I should say that you know there's only about 20% heritability for this trait so you know if you have a hard time getting up tomorrow morning for the nine o'clock session after you've been in the bar after the lobsters you know, like that's environment. You know, you can't blame your genetics for that entirely.
Um, and so, so this power of being able to use genetics to link genes to, um, causal effects on phenotypes has meant that, um, it's already made large contributions for finding translational targets. And uh one of the uh most um inspiring of these I think is um work that was done quite early on on identifying variants that affect levels of fetal hemoglobin. And um uh this is an updated um Manhattan plot for um fetal hemoglobin by VJ Sran's group. And um and so you can see here that there's a one of the top hits is in this gene BLE 11A. And so starting from about 2008 when this was first discovered um uh BLE 11A became the subject of intense study and um a couple of years ago it led to the the first FDA approved crisp crisper based treatment of a of a major disease.
Um so this is a direct finding that came out of insights from G-W was studies.
More broadly, there's been quite a lot of work that shows that um uh when um when genes have been identified in uh genetic studies that these substantially increase the probability that that drug trials that target those genes will be successful. So these are a bunch of the reasons to think that um that G-W was a powerful tool for learning about human biology. But I think it's also um important to say that um often interpretation insight can be very difficult and in part that's actually because G-W was in some ways more successful than we might have anticipated. So this is actually not a Manhattan plot but um this is a an image from a paper by Lurk Yango and colleagues from a few years ago study looking at height and what they're actually showing is the number of independent hits for height in 100kb windows tiled across the region and so you can see that a large fraction of the genome is close to a a genomewide significant hit for height. So um Loick and colleagues identified um about 12,000 independent hits across the genome that influence height and about 20% of the genome is close to a hit for height. And this really contrasts very sharply with what people imagined at the at the beginning of the G-W was era. So I I found this quote from 2006 from Francis Collins um talking about diabetes.
Okay. Yeah. Yeah.
Yeah. So I expect there about 12 genes involved and that all of them will be discovered in the next two years and it's a little unfair of me to to pick on Francis Collins. this was the best quote that I could find but actually many people believed this kind of thing and so it's you know it's just striking how different the um uh the outcome turned out to be and um you know so we find vastly more signal for than than people had expected but you know spread across a huge fraction of the genome and so then the question is when you find that signal is spread spread across 20% of the genome how can you interpret to learn biology. So a large fraction of genes are close to genomewide significant hits. So so what what can we do with that? So before we get to that, one question that you might wonder about and we we started wondering about this 10 years ago is you know as it became clear that so many genes matter you know why is that right? So the kind of conceptual model that I was s you know that that Francis Collins probably had in mind is that you know there's there's some set of genes that have you know some intricate role in the biology of type 2 diabetes and we should find those. So what does it mean if there actually thousands of genes that are involved in any given trait? So um we wrote a a perspective piece in 2017 in which we we suggested a model we refer to as the omnigenic model and this was was our way of trying to um propose a uh you know way of thinking about this problem. So the way we thought about this was that perhaps there's um some so so if you think about a relevant cell type for the the trait that you're looking at perhaps there's some set of genes that do have relatively direct causal effects on the trait itself. And so we refer to these as core genes. And then our thinking was that these core genes are embedded in gene regulatory networks in the relevant cell types. And so um all of these other genes that are upstream of core genes, if you have variants that change their expression levels, those effects could perhaps flow through the network and impact the expression of these core genes through transregulatory effects. And in this way of thinking, we suggested that perhaps a large fraction of the genes that are expressed in the relevant cell types might be plugged into these networks in such a way that you know essentially any of these might show up in a G-W was for for height or diabetes whatever you're studying. And in particular with we did some um theoretical modeling that was grounded on things that we could measure at that time. And we proposed um quite specifically that actually most of the heritability is due to indirect effects that flow through gene regulatory networks. So most of the heritability is not coming from core genes according to to this model. So um you know some people were a bit uh depressed about this but um I think it it poses a question you know how should we learn biology if this is the case. So we can identify these causal links from genes to traits. Um you know but it's you know not immediately obvious what is the mechanistic pathway that gets you from here to here. And so what I want to spend the rest of time um talking about is you know if if this sort of scenario is true then how should we build models that are mechanistic quantitative and predictive for human genetics. So how can we actually um take this enormous amount of information that we get from G-W was so if you think about you know 12,000 hits for height that we should be able to learn a lot from this you know h how can we extract meaning from it so one challenge at the time that we um uh published the 2017 paper was that we we the field really did not have tools for measuring gene regulatory networks um certainly not at any serious scale and so that was kind of major criticism that people um leveled at us that we you know we we couldn't really test this. But in the last few year well over over the last decade and particularly in the last few years there's been um enormous progress in large-scale perturbation studies including uh techniques like perturb and so I want to tell you a little bit about perturb seek and and how we think that it can um be used to to shed light on on G-W was studies. So um so here here's here's the basic concept. So so let's suppose that um I've got um bunch of cells in a 96 well plate like this and I could do crisper knockout of of one gene in one of these uh in one of these wells and then what's going to happen? And so when I when I knock down the gene um over time the mRNA is going to deplete from those cells and then the protein is going to deplete and then effects are going to flow through the network and that might take a week or so and so after that we could do RNA sequencing of of those same cells and we could compare the um transcriptto in the crisper cells versus some control cells that maybe got some non-targeting control guides and and we'll do differential expression and ask what are the effects on on the transcriptto of of that knockdown. Okay, so that's that's great. So we can do that. It's a little bit annoying because you have to do it one well at a time the way I've described it. Um so that's kind of low throughput and also you might worry about batch effects in these different wells. And so the idea of perturb seek is is that we'll do this but we'll do this in a massively parallel format. And um and so now what we're going to do is we're going to have a big pool of cells and then a guide library. And the guide library will target some much larger number of genes. So ideally nowadays we could do this with targeting every every express gene for the uh for these cells.
And so we're going to mix these guides onto the cells in such a way that each cell gets about one guide. And then we'll now again we'll wait a week for the effects to take um place. And then we'll do single cell RNA sequencing. And we'll also need to measure for each cell what guide it got. So now we'll get for each cell that the you know we'll know which gene got knocked down and then we'll be able to measure its transcriptto. And now we can um we can take the data and we can summarize it in a very compact form um that I'll refer to as the differential expression matrix. And so that looks like this. So here every row is a gene that we've knocked down. Um so we'll refer to this as a quote regulator. And then um and then every expressed gene is a column.
And so these are downstream genes. And um this has been processed in two ways.
So first of all, when you do the knockdowns, most genes actually don't have a lot of effects that you can um statistically detect with current technology. Um so so this is filtering down to about 15% of expressed genes.
And then we can do by clustering. So it's it's sorting genes with with similar sim is sorting rows with similar effects and sorting columns with similar effects. And so we'll call these downstream clusters programs. So program will be a set of genes that tend to respond in the same way to um to uh to perturbations. And so if we go back to this conceptual model I showed you a minute ago, um you can think of um you can think of these regulators as being knockdowns of genes that are um sort of key points upstream in the network. And then when we knock this down, it's basically going to um it's going to change the expression up or down of uh of of the genes that are sort of uh off to the right hand side of that regulator while leaving other parts of the network unchanged. And so you could think of a program basically as referring to some like a subn network of this of interconnected genes.
So this now gives us a way of thinking about how we could actually start to measure these gene regulatory networks.
And you know I suggested to you that this might be what what we need to understand G-W was data. So um so I had a postocck who started in the lab about four years ago Maneta Ot, um he was joined with my longtime collaborator Alex Marson. And so when when Manetto joined the lab, he he was interested in um whether we could use the uh the perturbic data which were just emerging at that time um to interpret genetic associations. And um so the situation's um updated now, but at the time there was only one genomewide perturb study that been published. This was from a leukemia cell line called K562. This is a study from Jonathan Weissman's lab. And K562 cells are um it's quite it's an artif it's a cell line that the genome's a little bit messed up. Um it's derived from a leukemia um cell line. Um the origin of this cell line is is plotted here onto the um the hematopoetic tree and um it's uh it's thought to originate um in these progenitor cells on the lineage that gives rise to the um the ariththraittytes among other cells. So this will be important in a moment. So you could think of these as these K5 of these K562 cells as being kind of representative of the uh of the red blood cell progenitor cells.
So the next question that we asked is you know could we uh you know could we use the K562 perturb seek as kind of a proof of principle for this question that we've been trying to get out for a while like is you know is the perturb seek useful in any way for understanding genetic associations and so Manetto um proposed that because of what we know about the hematopoetic tree and also a bunch of other analysis with LD score regression that I won't get to that um this may be a good um cell type for studying processes in red blood cells and in particular hemoglobin levels. And so there's a trait that I'll I'll tell you a bunch about going forward which is called MCH. And this measures the um the average amount of hemoglobin per red blood cell. So what kind of signal would we look for in the data if the perturbed seeker actually useful for us? So um one one of the the genes that forms the one of the proteins that forms part of the hemoglobin complex is encoded by a gene called HBA1. And so you might imagine well you know if I've got a mutation that changes the um uh you know changes the expression of HBA1 then that's almost certainly going to affect hemoglobin levels. So I mean HBA1 really feels like a core gene. Um, and so we can also think about regulators of HBA1.
So if we think about a a gene that is a a positive regulator of HBA1, then you might expect that that gene would also be a positive regulator of of hemoglobin levels if we look in um in human subjects. Conversely, the opposite for genes that are um negative regulators of HP1. So we can get in principle we get these kinds of information from the two different types of data that we can collect right so the perturb seek tells us about these regulatory links and then uh the trait effects are things that we could measure in large scale G-W was studies like in UK bio bank so that that's sort of the you know the kind of signal that we might look for if there's more you know if there's any kind of relationship between perturbs and G-W was now there's one challenge that we get immediately which is the way I described it we need directional effects of genes on hemoglobin from from the genetics and um people often don't think about this I think but um when we do standard G-W was it does not actually estimate gene effects. So let me just uh take a little diversion for a moment to tell you how how we think about this. So the the kind of cartoon that I sketched for you um sort of thinks about this part of the graph here. So we've got a gene that's associated with a phenotype. And the way you might think about that this is that a unit change in expression of this gene has some effect size gamma on the phenotype. So maybe we increase this gene expression. We increase the phenotype if gamma is positive. Okay.
But when we do G-W was then typically we'll find a snip that's somewhere in the in a non-coding region nearby. Um perhaps it sits here in this enhancer.
Maybe there's a snip in a transcription factor binding site. And so this snip has some effect uh on expression of this gene. I it's an EQTL and so there's some unknown effect size beta of the snip on the gene and so then the effect of the snip on the trait is the product beta times gamma. All right. Now, we can measure beta times gamma when we do G-W was, but we actually only really care about gamma. And usually we don't know beta, but that's, you know, so like beta times gamma is complete the thing we want. So beta times gamma is completely confounded, right? So we're a little bit stuck and you know, you can try to link EQTLs to G-W was hits and there's a large literature that shows that that's actually harder than one might imagine.
um you know so so that's kind of a challenge. So as an alternative we started thinking about loss of function mutations and so these are mutations like um indelss that cause frame shift mutations or premature stop codeons and the thing that's great about these is that they um they will comp ideally they completely knock out um one of your copies and we all carry about 30 loss of function mutations sort of sprinkled across our genomes. So each of us has you know had a you know sort of effectively only got one functioning copy for about 30 genes. So what we could do is take each gene and then uh look at the individuals who've got um two copies of the gene versus the individuals who've got one copy of the gene and then see if there's a phenotypic difference between those two groups. And in fact this picture here is from the uh the very first paper that I know of that applied this. very famous papers from Jonathan Cohen, Helen Hobbes. And so they were looking at um loss of function mutations in the gene PCSK9 and the effects on LDL cholesterol. And so what you see is that these loss of function mutations are reducing LDL levels by by maybe one and a half standard deviations. And this result led to the um PCSK9 inhibitors which are an important backup drug for um when for patients when statins don't work well. So um the like one really great advantage with these loss of function mutations is that they are directly interpretable in terms of uh both a direction of effect and a quantitative magnitude of effect. And um these are often referred to as burden mutations because what we'll do is pull together all of the loss of function mutations that um exist in a single in each gene um assuming they have this uh the same effect. So how does this relate to the G-W was story that I told you a moment ago? So we like to think about um uh what we refer to as a gene dose response curve. So you could think about um varying gene expression along this axis here conceptually and then this and then that maps out some unknown uh curve where the y-axis is the expected trait value given this gene expression. And so you can think of the loss of function mutation as being a particular point on this curve. G-W was hits are usually somewhere in here, but we don't know which direction they are. And then when we do a knockout and crisper, it's also sort of a like a large effect mutation that's you know sort of perhaps similar in magnitude to loss of functions.
So um what can we do with this? We can now go into a large data set like UK bioank and do these burden tests on every single gene. And so if we do this for MCH, we get data looks like look like this. So here in UK bio bank we're looking at nearly half a million individuals. This is a volcano plot where each dot is a different gene and it's showing the uh the signal from doing the burden test on each gene. So that the x- axis is the effect size of the LOFS on that gene on hemoglobin levels and then the y- axis is significance. And you see here a bunch of the so all of the core components of the hemoglobin molecule are here like HBB, HBA2 and HBA1 as well as many other genes that are um highly significant. Um there's one more thing I want to uh uh just mention which is that um because we're looking at loss of function mutations. Um the if the loss of function mutation reduces the phenotype then you can think of that gene as having an activating effect. Um and conversely it you know if it if the LOF increases MCH you can think of the gene as having a repressing effect and that'll come back. It's a bit confusing so I'd like to spend a moment on that.
Okay. So what do we actually see for HBA1 now that we've set this up? So remember we we drew a picture like this.
Um do we actually see this in practice?
So we can go and do um these burn tests on every uh every gene in the genome.
And in fact we find there's a very strong signal of this. So um happening.
So here what we've done is we've sorted all the genes in the genome by their loss of function burden effects. So here are the the the most um act the most repressing genes on the left and the most activating genes on the right and then uh for each gene it shows the perturb seek effect. So many of these genes have got large perturbation effects on HBA1 and not all of them we would expect that not all genes are not all genes that affect hemoglobin are regulators of HBA1 but many are and in fact genome wide there's a very strong correl highly significant correlation about 10 the minus 7. So this kind of emboldened us to thinking that the um you know even though the K562 cell line is quite an artificial system is actually telling us something really important about um uh you know regulation of of the cells and the ariththraite lineage that we can use to interpret um the you know data from uh you know middle-aged adults in Britain.
So okay, that's kind of a proof of principle there, but we would like to be able to extend this to be a bit more ambitious to learn new biology. So first problem that you run into is that um um so the first problem that you run into is that the the data um the Pacific data are actually a little bit noisy.
And so uh the analysis I showed you worked for HBA1 because the signal is so strong. But in general um uh we and other groups in the space have found it's very helpful to decompose the um the data using methods of matrix factorization that are basically sort of clustering the the signals into um uh into simpler patterns. So you can think about a gene that is a matrix that shows you um uh that cells as rows and genes as columns. And then what we're going to do is to decompose this using a methodical consensus non- negative matrix factorization into um two blocks.
So the first block um reduces the uh the data into a list of cells by the columns of programs. Remember I showed you programs before. These are going to be clusters of genes that tend to be um upregulated together as groups. And then um each cell you could think of there you could think of each cell as having um uh some programs turned on more and some programs turned on less.
Okay. So um there's so so when one looks at the programs what you find is that um very often these programs are actually highly biologically coherent processes.
This will be key in a moment. So once we've got the programs we can uh then ask when we do the perturbations which perturbations affect each program. So we can get this basically by um taking the perturbation data and asking you know which which perturbations are associated with either increases in expression of this program or decreases in expression of this program. Okay. So now we've got regulator program pairs. And so our model is going to be or our hypothesis is going to be that these programs are biologically coherent. And so some of these we hypothesize are going to affect the trait in this case hemoglobin levels. So now we can think about two different kinds of tests. So one kind of test is we can look up the like what's the effect of each gene in this graph on hemoglobin levels. So um we can ask like are the genes do the genes in the program have a an average effect on the on the trait. So so do genes in in a particular program tend to have positive effects on the trait or negative effects on the trait and you know if a program was unrelated to trait altogether we would expect that the um like the mean effect should be uh you know roughly drawn from the null distribution. And then secondly, we can look at the regulators and ask is there a correlation between the regulator effects on this program and the regulator effects on this trait? And that would suggest to us that these regulators are significant be in part because they're acting through their role on this program.
Okay. So, you know, we set that up. We we go ahead and run this and we're very happy. So, we we get uh we get a bunch of programs that are, you know, quite s reasonably significant. and and in fact the programs when we go and annotate them um post hawk we find that they're actually involved in a bunch of processes that uh we app priori should have expected to find. So so um here in this plot each dot is a program and then the the colored ones are significant in one axis or another. So the x- axis is showing the the program signal. So again that's asking whether they you know the the program itself has shifted positive or negative and then the y- axis is looking at whether the regulators show a signal. So we see hemoglobin synthesis is very significant. We see a bunch of programs involved in cell cycle and we know that cell cycle is important for the differentiation of the ariththraittes and then autophagy and autophagy is a program the process where they're um sort of getting rid of all the organels which is an important part of red blood cell maturation as well. So okay so that's great. We're very happy with this and then I I'm thinking about this and okay we've got a problem here like this doesn't actually make sense.
So let me explain the problem. So what you might expect is that like let's suppose a that a a program has a positive effect on the trait. Then you would expect the regulators that upregulate that program should also have a positive effect on the trait and the opposite for the repressors. Okay? So that's what you would like to see. But you could have a re you could imagine a reverse pattern on the regulators. And we would think of this as being a discordant signal between what we're getting on the regulators versus what we're getting on the program. And in fact, there are some programs that are um you know, completely in the wrong part of this picture.
All right. So, you know, we kind of panicked at this point tried to think about, you know, what what have we done wrong? And so, after thinking about this for a bit, Mento figured out that actually these programs are strongly jointly regulated in many cases. So here um what what we're showing is um each gene is a dot here and then the x axis the perturb seek effect on a program that relates to um G2M phase of the cell cycle and the the y-axis um is the perturbse seek effect on Sphase and you can see that these are highly negatively correlated and in retrospect this actually makes a lot of sense because Sphase and G2M are are different parts of the of the same cycle. they are largely mutually exclusive and so if you're spending more time in one you're spending less time in the other. Um and then for autophagy again we get a very strong joint distribution pattern but it's it's actually quite an interesting different pattern. Um we can al also go and look at the single cell data from the perturbse seek and so these are the proportions of Sphase G1 and G2M phase in the uh control cells and then we can see that um for example when they um that when we look at the the guides that downregulate G2M then you can see that the knockdown decreases the amount of time in Sphase increases the amount of time in G2M so you know the these patterns are really making a lot of sense to us. Okay, so this starts to suggest that we can't actually consider these uh these programs in isolation if we want to understand what's going on.
Um so we we did a bunch of further analysis and um to describe to explain this I'm going to simplify it by um uh by grouping these genes that show up in the um as affecting the cell cycle into two categories. So group A, so regulator A and regulator B according to their effects on G2M. And then uh we did a bunch of multiple multiple regressions and I'll spare you the details. We we drew this graph that um shows the effect of the the a set of regulators on S-phase autophagy and G2M and then how these effects flow through to affect hemoglobin. Okay. And remember red arrows in uh indicate something's activating and blue indicates it's repressing.
So and then of course actually all the genetics relate to the uh LOF effects.
So now um we can think about this and we can sort of think about these uh these wiring these circuits sort of almost as like these um quantitative wiring circuits. So to think about this it's helpful to um uh think about the signs of the edges. So if the edge is blue I want you to multiply it by minus one and if it's red I want you to multiply it by one. Okay. So for the A group of regulators down down this first path it's it's um it's so it's going to be negative time positive time positive. So it's that's got a negative in aggregate this this path has got sort of a negative sign. And then this is negative* negative* negative. So that's got a negative sign. So all of the paths through this A group have got negative signs. But when we look at this B set then they've got cancelling effects. So this is going to be negative times positive time negative. So it's got a positive effect. This the path through autophagy is is um negative times negative time negative. So that's got a negative effect. So um what you might gather from this is that the A set of regulators you should predict that these have got an overall negative effect on hemoglobin and it should be a much more negative effect than the B um regulators. And in fact that's what we see. So there's about um so so both sets of regulators are negative um have negative effects on on hemoglobin levels but the A set is about uh a threefold larger magnitude than the B set and the reason I spent a lot of time on this is that in human genetics we very rarely had the ability to actually make quantitative models and I think that um this is a really nice example of how we can start to make human genetics much more quantitative and much more predictive.
So we now start to use this logic to build out larger causal pathways. And so I'm going to refer to this as a trait causal model for hemoglobin that we can infer. And so here um this graph is representing how we think that the um genetic effects are flowing through different pathways to affect hemoglobin.
So we've got MCH the the trait in the middle. Um we've got the major programs that have effects on MCH. um one and then the genes are colored according to their uh the direction of trait effect on on hemoglobin and it's just showing um genes that are significant um with direction of effects that match the model. Um one kind of nice thing here is that uh we can mask genes when we we can mask a subset of the genes when we build the model and ask if we get the directions of effects right and we're not perfect. I mean the model is obviously a simplification of reality but we have about a 75% accuracy here um which is a highly you know far better than what we would get under a null model and so I think that this is you know really pointing to the idea that we can start to make human genetics into a much more predictive field than what we have now. Um I just want to tell you about one one anecdote that we we get from this graph. Um so this gene subt5h is a a major um a major anemia gene and if you if you look up the gene um you'll find that it's a transcriptional elongation factor and it's not really clear app priori why a transcriptional elongation factor seems like quite a you know sort of a generic process in in gene regulation why should that be specifically involved in anemia. So we think that this graph actually gives us the answer to this. So it turns out that subt is a major regulator of all three um set types of programs that affect hemoglobin. So it's a positive regulator of hemoglobin. It's a repressor of these aspects of cell cycle and autophagy. And it just turns out probably coincidentally that the directions of effects for subt um to be to point in the same direction and make subt.
And we think that this is the sort of insight that in future one could use to try to pick out drug targets by understanding exactly why a particular gene shows up in genetic associations.
Um the next kind of thing that we can do with this is to look at graphs that are combining across traits. And so um there are uh several other traits that are measuring red blood cells for UK bio bank. So RDW measures the the variability in red blood cells and it's a it's an important measure of um aging and overall health. Um IRF measures fraction of immature reticular sites. So if you look at the both the phenotypic correlation and the genetic correlation, you can see that all of these traits are moderately correlated. So hemoglobin and RDW have um quite strong negative correlations. we can interpret that in terms of their co-regulation by these different programs. So, so for example um autophagy and cell cycle both have opposite direction effects on MCH and RDW. Um but these other two programs progenitor maintenance and hemoglobin synthesis have independent effects on one or the other. So most genes where the burden effect affect both um uh both traits are like these these three genes where they have opposite direction effects. So, so here these hamburgers tell you um the directions of effects um uh from the from the human genetics of of each gene on each of the three traits. You can see here that subt has got opposite direction effects on on hemoglobin and RDW. And that's um and that's because it's got it's a major regulator of um uh of autophagy and cell cycle. But there are a few other genes that have um you know where where the where the effects of those genes on um on MCH and RDW point in the same direction and that's because they're sort of independently um regulating different programs that have independent effects on the blood cell traits. So we we think that we can learn a lot from these trait causal models um about exactly what is the wiring pathways for for these regulators how they flow through programs to affect traits.
So in the last few minutes I want to just spend a uh say a little bit about how we're thinking about building this out for other cell types and other traits. And I think it's fair to say that the the the K562 hemoglobin example is is almost a best case scenario. So the hemoglobin is actually like really a a one cell type kind of trait that we can you know in some ways it's relatively simple. It's much simpler than most disease traits that we might be interested in. Um but I think that we we do have an opportunity now to uh to build the tools in human genetics to uh you know to really push this forward. So I think we need a lot more perturbation data and the good news is that perturbations are becoming much much cheaper. There's they're now at a kind of a you know the the technological difficulties and the um and the cost have reached a point where individual labs can do whole genome per perturb seek in their own labs um and I'll talk a little bit about quantitative estimation of gene effects and new approaches for graphical inference.
So um very briefly um we're working with my my longtime collaborator Alex Marson led by um Ron Shu and Emma Dan who are two uh wonderful posttos of joint between our group. We've recently conducted a genomewide perturb seek study in CD4T cells. This is the first um genomewide study in a primary cell type. Um we were able to do this in uh three different conditions. So um resting and then after stimulation um and uh we were able to do this in 22 million cells. Um if you've done single cell sequencing you know that's a lot of cells. So so thank you CZI for that. Um and there's there's a preprint that's got a lot of interesting stories. Um but I'll just show you one thing which is that uh we find that the uh gene regulatory networks are rewired dramatically across across the three time points even though they're separated only by 48 hours total. So um this is a a correlation matrix that where so each row and column is a um uh is one of the major regulators involved in a complex called um in a uh regulator uh group called mediator saga. And then each regulator is repeated three times, one for each of the three once for each of the three conditions. And you can see that the um the off diagonal um the off diagonal blocks kind of correspond to the correlation of a regulator in one condition versus different time point.
And so what we see is that the downstream effects from this cluster are shifting across the 48 hour time period.
Um you can see that they're like this the same cluster of genes. is active in all three time points but the downstream targets are changing and so what this shows to us and I think it's you know it makes a lot of intuitive sense is that the cell types and the conditions really really matter and there's been a lot of interest in um in the field I say particularly in Silicon Valley and doing perturb um in mo in cell lines and I think you know trying to get you know convenient systems I think this really points to the importance of um you know doing a lot of work on you know getting biologically uh accurate models of the system you want to study. Um second thing is that we need to make G-W was much more quantitative. So I showed you that um uh I showed you that the uh loss of function mutations were very powerful for us but those leave a lot of information on the table. Um so uh you know ideally we might like to estimate these gene dose response curves and um most of the signal in genetic associations are these uh sort of small effect G-W was hits and we need to get much better at um how to interpret those and recently um Tommy Goggv in the lab has been studying um uh variants that have um effects on translation that we can measure from an MP and these are um average curves we're averaging across many genes Um so we identify variants have got um different effects on translation and then we're stratifying them according to their um uh their LOF effects on a trait and then we can see the that on average they they tend to have these um these sort of shapes of curves we predicted um just you know conceptually um but I think that there's an like an important challenge now to get much more directional information from uh from G-W was and then lastly um we've been doing a lot of work on how to build um more advanced models for uh combining the um uh the the genetic information with the perturbations and so Emma and Roma in the lab have been working on something called phenobridge which is a joint model um for combining the data and I won't tell you the details but um basically there's a there's an internal layer that is um predicting gene expression from a series of programs and and then the loss function is going to combine a loss on this perturb seek layer with a loss function for the effects of genes on traits as of again flowing through these programs. And then there's a a factor lambda that's basically weighting how much of the information and the model fit comes from each part of this. Um and then we find that estimating fitting these models jointly um in most cases gets us much better models. Um and here here's one application to modeling lymphosy counts. So just to end um what I have um tried to argue to you today is that um genetic association studies are these essential tools for studying human biology um but it's often very difficult to interpret even though we have an enormous amount of information there and I think that perturbation studies are really maturing now to the point where they can be incredibly helpful. Um, and my goal here is to argue that human genetics should be much more quantitative, much more mechanistic and much more predictive. And if we think about the the challenges to implementing this, I would say that one is that we need a lot more perturbations and many more cell types and conditions. And this is really happening now. Um, like it's it's there's no question it's going to happen. Many groups are starting to do this and the experiments are really becoming affordable. Um secondly we need you know we need to solve a bunch of statistical challenges. So um you know GW was very powerful but it's not quantitative. The LFS are quantitative but they're not powerful. Um we need to get better at imputing perturbation effects across conditions. We need to be able to get better models for inferring gene regulatory networks from these differential expression matrices. That's a really challenging problem. And we really need new techniques for doing causal inference. And it's um sort of an unusual setting where we're gluing together data from these very different settings and we want we would like to build these joint models.
So I'll just end here but you know of course um this work represents um 10 years or or more of work in the lab spread across many many people um and um several of these people came up during the talk but um Alex has been a key partner through most through through all of this. Manetto led a lot of this work.
Ron and Emma, key people as as well as Jeff. Um, and yeah, and then here's a picture of the lab last year. So, um, thank you very much for your time.
>> Questions?
We have one right here. Do we have >> Oh, yes. Yes. Thank you.
You mentioned that the blood the hemoglobin the blood trait was kind of an ideal trait to to build these statistical tools. Going back to your example of height, which is much more complex, could you speculate for me what would the similar level of explanation that you demonstrated with the blood cell type look like for height? Are we going to need the same types of models with multiple cell types? Are we going to have to model an interaction between cell types? what would a height explanation look like?
>> So my um my way of thinking on this is that um the like the the the first target for you know for traits that are mediated through multiple cell types is to think about um you know we we can you know so the the programs you know right now all the programs are coming from the same gene but we could from the same cell type but we could actually think of labeling these programs and arrows in a cell type specific way. that the the trait is an out an outcome of flow from genes through programs in different cell types and then converging on a trait.
And so what we will need to have for that is is perturb seek from a bunch of from all the relevant cell types and then we can you know think about how to put those together into a graph. And we've been playing a little bit with that from uh you know there's there's there's limited perturb seek data um you know for limited gene sets but many cell types now already and so we've started playing with this so I think it's going to be feasible um you know there's also things like cell interactions that make this more challenging um you know but you know we make baby steps towards where we want to get to >> uh thank you Jonathan great talk here Yeah. Yeah. So your title has G-W was in it. It's all about G-W was but your vignette is actually about rare varants perturb. So it's not exactly GW was as people think of right. So of course you mentioning passing at the end about GW was versus loss of function and you do have a lot of work about common to rare variant in GW was so I just want to uh if you could make a connection for common variance in GW was to this particular context of perturbic or you're saying that we should just rep prioritize and focus on the rare variants as low fruits for now within the new framework.
uh so so we have found that rare variants are much easier to work with but they are woefully underpowered and for many disease traits we don't have enough sequencing right on the other hand you know we've got 20 years of collection of large cohorts with snip data and you know most of the signals that we know of are common variant signals so I think the field has to figure out how to use the common variant signal um to the extent that we've been able to explore this we you know we we see that there are a lot connections between the common variant signal and the and the rare variant signal. Um we have a paper that's exploring some of this last year. Um so I I feel this has to be a solvable problem by the field and I I guess I would challenge people in the room to you know think about how to get there and you know we we are working on this a lot but there is so much signal in common variant G-W was it would be you know it'd be crazy to you know walk away from it now. Um it's just been hard so far in our hands. Thank you.
Um, can you use the uh Sorry, thanks.
Great talk. Um, can you use the sort of time element of uh over time when if you sample over time, you see different gene programs kind of responding downstream to try and uh give sort of an order of response in the network like importance. Uh >> yeah. Yeah. Um I think that is a great idea. um we we collected a pilot data set with that's a time series for that and um we we had exactly the idea that you're thinking of um and I think we didn't sequence quite deeply enough so it hasn't really worked out but I I completely agree with your your suggestion to do that and I'd love to redo that experiment.
>> Thanks.
>> I think I might ask a quick question. I found it um as somebody who worked on network inference you know for decades I found it sort of disturbing that you showed that the gene regulatory network itself is changed under perturbation. So when we're trying to predict the effect of perturbations a lot of what we're assuming is that we know the you know or we can model the gene regulatory network to make forward predictions. So if the network itself is changing >> sort of like what hope do we have or do we have to just measure all the perturbations themselves?
So I think we have to measure all the perturbations.
>> No.
>> Uh I'm I'm a simple man. I like I think we need to we need to collect the data.
Okay. I'm gonna I'm gonna uh qualify that a little bit though. Um so I mean the way I think about this is that um you know for you know so so with Alex like we were we work in in the hematopoetic system right? So, you know, I can't imagine that we're ever going to get, you know, full genome perturbed seek data in in every in every conceivable context, right? But what I imagine doing is that you know I I think in a few years time the field will have immune d immune perturbed seek genomewide in let's say 15 or 20 different contexts and you know maybe we have a dictionary of um of different regulatory pathways that exist somewhere and then you know maybe if you've got that dictionary and then you've got um you know like sort of lowdimensional data like you know unperturbed RNA seek attack HK27 or something this, you know, maybe you can, you know, use that to imputee. So, I'm I'm optimistic about imputing things you've already seen somewhere.
>> Wonderful.
>> Question.
>> Um, hi Jonathan, great talk. Um, I wonder if I can ask a maybe provocative question, which is the idea that perhaps for many common variants, the phenotypic impact may depend on the genomic context. And in in in other words, do we need to start thinking about assessing the effect of loss of function variance in many you know not just in thinking of the cell type but thinking of like different genomes testing it in you know cells derived from different individuals if you think about the network >> aspect of this and the combinatorial aspect of this is this something that we should start thinking about >> so there's no question that you like for for large effect variants, you get you get interactions. But um for you know like for small effect varants you know human genetics have consistently failed to find interactions and um and so my working model is that we can we can get really far with additive models and um you know and then I also think that you know doing science is the art of doing the possible right so um you know we like we're doing these experiments with small numbers of donors and uh you it's it's hard for me to imagine how we can you know so we can really model all of the interaction effects at um you know at scale. Um I will say one thing which is that we we actually do see significant donor differences and I think that's probably more about more a matter of um you know maybe their recent infection status or or or other things that we're not measuring which may you know often will be environmental effects. So um you know when I think about you know sort of like those those kinds of uh you know variation across individuals um my bias without a whole lot of data is towards that sort of effect.
>> So in other words you for in terms of thinking about it in an additive way you you are a believer that proof of absence is absence of proof right? Uh you mean on genetic interactions?
>> Yes.
>> Yeah. Yeah, I mean many people including us have looked for like tried really hard to find um evidence for interaction effects among common variants and um you know like it's still it's still possible that like some small part of the variance is coming from interaction effects but it's only a small part and you know you know I I always tell them out loud like we you know we do the experiment we can do right now and it's very rarely perfect but that that's where we start so that That's, you know, that that's kind of how I'm approaching this.
>> Okay. Well, let's uh let's thank our keynotes again. That was wonderful.
Ähnliche Videos
What Actually Makes You Grow
naturalway-w8e
3K views•2026-05-29
C2C | Concepts 2 Conception #Conference 2026 | Fertility Conference #C2C #Event #ReproductiveHealth
Hegdefertility
891 views•2026-05-28
Koji - the enzyme powerhouse 💪
EdibleAlchemy
18K views•2026-05-31
KPV Peptide Benefits
ReganArchibald
168 views•2026-05-29
A Paper Mill Dumped Wood Fiber on Her Farm for Years...She Used It to Grow 800-Pound Pumpkins
FarmlandChronicles
436 views•2026-06-02
The Prague Chimera – What We Know So Far and Our Experiments
themulberries
619 views•2026-05-28
Every Genetic Gift You May Have Explained
ChefCalebYT
211 views•2026-05-31
Mechanical Characterization and Modelling of Tissues (Intro Video)
npteliitd
109 views•2026-06-02











