Graph-based data integration enables more effective pharmaceutical R&D by creating patient-centric networks that connect individual patients to biomedical entities (genes, proteins, pathways, drugs), allowing for patient subtyping and therapeutic hypothesis generation through evidence chains; however, successful implementation requires appropriate data representation, business-driven use cases, and proper benchmarking rather than simply building the largest possible graph.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Individual-level Data Integration using Graph Approaches - GT Pharma 2025Added:
Thanks for uh for joining in. Um the the talk today is is a little bit uh on on the target discovery area of of R&D in buyer. Um and it's um uh and it's just a suggestion of how uh you can approach the role of data science from a business perspective first rather than from a technology perspective.
Um if you um if you can look at um at how in R&D uh we uh kind of develop technologies that enable us to discover uh new things um uh that eventually make it into products, right? And um and one thing that we've noticed uh is that a lot of the data science projects uh that are um established in R&D are usually of the accelerating kind. So they their main goal is to make things easier faster for people or add scale. Uh so for example being able to generate uh summaries faster, being able to um link together documents in a in a in a more automated way. Um what I'm advocating for here is that we can think of data science as also a way to make different decisions. So established data science technologies as complementing capabilities to the core uh research in the lab capabilities of pharma R&D uh that then enable us to make different decisions even better decisions and what I mean by that in the context of of um of drug development is maybe decisions that even improve probability of success in clinical trials. And so the question is can we use data science um in this case network and graph approaches to enable disease understanding so that we have a way to predict whether a certain target can u we have a way to predict this relationship between a new idea or a new therapeutic hypothesis and its probability of success.
And uh when we think about that we have an overarching kind of uh strategy for precision medicine um which um we don't want to target every single patient but but let's say different patient segments have things in common disease mechanisms in common. So for example within a certain complex disease there may be four different overarching disease mechanisms. If we are able to to find those then we can target the right therapy to the right patient uh and so that uh requires understanding disease mechanisms and if we actually want to do it of course patients uh don't come in uh so obvious like this in in four different groups or whatever number um there is um everything mixed and what we have the view that we have is the data that we can um uh that we can get uh from uh from our health care systems or from our research whether that's OMX data in the uh related for example to the site of disease or biomarker data from the blood ehr data or real world data and our task is how can we integrate this information here on the left to arrive at something in the middle so that we can enable this disease understanding and um I'm going to give you the point of view of of a of this disease understanding here with an example on chronic kidney disease. GFR is is a um is a measurement of of kidney filtration uh which is the main barometer of whether a patient has kidney failure or what the degree of kidney failure uh the patient has and usually uh if this is uh then too low then the patient has some kidney disease but the kidney disease causes that are observed in the clinic we can call them kidney disease entities they can be due to many different things there can be genetic disorders there can be autoimmune disorders diabetes, other chronic conditions like diabetes or hyp hypertension. Um but actually if you take expression data from the kidneys of those patients um and um and just um you know just cluster those patients or plot those patients uh according to the expression data in their kidneys. you don't find necessarily this separation along ethologies that are you that are used um that we use in in in our everyday stratification of patients. So this can give you a hint okay uh these known entities may not directly relate to these underlying mechanisms that we want to use to match the right patient to the right therapy.
Um we can take another datadriven view of this which is the clustering view. So if you uh take now uh RNA seek from the blood uh which is you from the blood cells so the immune cells circulating in the blood and take proteins from the plasma so and uh from the same patients and cluster the patients on the uh on the RNA that's right here where in the top left um u okay you have some clusters but if you then um plot the clusters you get out of their uh protein data. You can see basically that the RNA and protein clusters don't match. And that's the same here on the right where you can cluster the patients based on their plasma protein concentrations and then try to match the clustering you get out of their RNA seek and you can see they don't match. So even so even in this very limited view where you have the same kind of tissue blood and you're just measuring two different views of that, you can see that clustering patients separately on this and that doesn't work. So that's why you need the integrative view to arrive at the at that uh underlying mechanism. Uh and so what do we mean by integrative view?
Then in a general context you have the genome, epigenome, transcriptto. Can we combine those together with clinical data or another patient information to come up with consistent clusters of patients? And when we say consistent clusters, uh I'm just going to kind of uh pre-post this. it comes back at the end of the presentation. Um we will um uh look into this clustering as a byproduct of predictive models.
That's like our holy grail. Uh because if we can make a prediction uh validate that prediction then we know the clusters are actually uh meaningful.
Um okay so just as a as a recap of that first part this whole motivation we want to make uh different decisions better decisions that enable more probability of success. We see doing this by integrating patient level data. And we see this as very important, the integrated part as very important because it allows us to capture these disease entities that are related to the underlying disease mechanisms.
And um yeah. Okay. So a graph framework is really good for scalable integrative analytics. Sounds cool. Uh maybe a catchphrase for Neo Forj, but uh we want to do it on the individual patient data level. Uh so that's maybe what is um what is the key suggestion here is have individual patient data in your graph coded in the right way. That's an example uh um graph schema here where you have some genes connected to pathways, drugs, uh proteins, whatever.
Uh but then you have the patient that's also plugged in with connections to their phenotypes which maybe you could get from EHR data for example connected to their genetic information connected to which genes are expressed where. So you have of course to think about how you code these features as a graph but the idea is that the graph is not only entities uh biomedical entities but rather it's a patient centric graph and what you can use such a graph for is actually things like patient subtyping biomarker ID that's that's very um that's very straightforward then if you have patient in a graph you can do patient you can do basically patient subtyping biomarker target ID or target ranking for experimental validation and so what I'm going to All this framework overall uh going forward in the in the presentation is patient maps just for short. So it's a patient maps.
It's a graph with a patient in the center. Um that is then used um by various ML algorithms to arrive at one of those three outcomes.
Um and now I'm going to give you two examples of how we use this and talk about the lessons learned from this. Um so the first example is this patient segmentation or finding subtypes of diseases. Um and if you start with this patient centric network, what you can do is define some rules uh for um uh for the kind of the paths that might make sense and uh and then um basically just ask a um a graph walker to walk along these um these these different paths that we've defined as meaningful for us.
uh this generates uh candidate paths which can then be summarized into uh into embeddings um and these are common embeddings for all the entities in that graph. So you're B so you're essentially here co-clustering all of the entities in the graph one of which is your patient um and here we can view this reduced to 2D basically by a U mapap um and what I can tell you is this works uh s more surprisingly better than than than I at least first anticipated.
clinical parameters that we left completely out of the graph for example can be uh predicted um from the embedding. So here what you can see on this uh heat map is patient embeddings here on the x-axis on the y- axis various clinical parameters that were left out of the uh of the data completely and we can see that our embeddings based largely on just molecular data uh recaptured these clinical parameters. So as a first prototype, this is kind of proof of concept that you can do these graphs uh patient centric graphs and you can use them to predict uh clinical parameters.
Um however there are some lessons learned. Yeah. So the graph by itself was not really a solution. Yeah. So we needed to run these um these we needed to experiment a lot with how we can generate useful embeddings. So data integration by itself is not the goal and is not a solution by itself.
Also at the end when we arrive at patient groups uh and we try to take this into our organization of course a patient segment or a patient uh cluster of patients is not by itself a new target or a therapeutic hypothesis.
Um so this needs a lot more development and this is a first step and it needs more development. Uh unsupervised methods also are not necessarily the way to go. They're a good first pass at things. Uh but you're always going to think, okay, if I change something, I can arrive at different clusters. So where is it exactly that you um that you say, okay, I believe in my in my patient segments now and I want to take this to the next step to a therapeutic hypothesis.
So if we if we now go for a for a summary, I've showed you now how we can do these uh how we can do use these patient centric graphs for patient subtyping. What I'm going to show you now is this next step of arriving at a therapeutic hypothesis. And what we mean by a therapeutic hypothesis is something like this where you have a target. It's related to a patient population that's defined by a mechanism by a certain disease uh cell or molecular mechanism as we just described and we can capture that mechanism by a certain biomarker and we can identify those patients by that biomarker. So that's overall what a therapeutic hypothesis is.
Um you can uh think about that from uh this kind of framework of a causal chain um which might take you to basin networks where you can see here on the left from work by Daphne Colola for example of trying to generate these basin networks that explain patient u that explain basically diagnosis for different diseases. Um and you can see on the right an example of of this causal chain the way we think about it now how we want to derive it. Uh so for example you have a target uh that we know is related to interstitial fibrosis in the kidney. We know that fibrosis affects GFR so filtration rate of the kidneys and we know that this is the defining thing for CKD for example. And so by building that we have a causal chain of why that target is a good target for chronic kidney disease. Um and the question is can we arrive at that from a graph within a certain patient segment. I'll show you here something we worked on this actually with with new forj with colleagues from new forj. We we um as an alternative to embedding methods we we've also worked on this kind of rule-based learning. Um and the idea here is you have to define your relevant kind of graph the template for how you want your therapeutic hypothesis to look like. So this is a little bit of work. Uh I don't know if you can automate this. we had this discussions LLM and so on but let's see um you have you have these kind of over template um uh on this to predict certain uh links and those uh link prediction are then explained by different paths in the graph. So what just generates is these as the paper called it candidate evidence chains. So what I just called in the slide before uh causal chain they call candidate evidence chains and this distinction we come back to it in in a second. Um and if you uh and if you are able to uh combine and and kind of interpret these uh evidence chains, you might arrive at something u at a bit at essentially um candidate therapeutic hypothesis here as I showed you in the the previous slide.
So what is the evidence the the lessons that we learned from doing this exercise is these evidence chains are not causal chains. So exact so what what you get out of the graph is just that it's it's how it's the data you put in and how you coded that data and so it's not by itself a probabilistic predictive model that you can say that you can for example predict um uh the size of the effect. Uh so one question we we sometimes have or almost always have is okay if I modulate that gene by whatever percent what is the effect I might have at the end on the phenotype that I'm interested in. So you you cannot make these um you cannot answer these questions with with just that. So that's not by itself uh a probabilistic model that would then take you if you want to go there that's then the next step how you can actually use that to derive a valid probabilistic causal uh model. Um but what we have now can help prioritize experiments because you come up with some candidate causal chains or some candidate evidence chains that can help you say okay I have now like five or whatever ideas about this target and this disease let me then experiment on those instead of having the full space of experimentation and you don't know where to start. Um and then of course at the end benchmarking is very key to realizing value. So if you don't have the right benchmarking data set to understand whether these suggestions from these evidence chains are actually uh improving your probability of success or not then that's a major drawback. So a lot of work needs to be uh done outside of this uh story on actually creating a good benchmarking data set linking drugs to targets to how they did in clinical trials.
So we can measure the performance of all of this.
So as kind of an outro uh I want to kind of zoom out a little bit. So we've we we we talked about those uh patient maps or these patient centric graphs that we can use to do patient segmentation to do therapeutic hypothesis prediction. And I guess my my lesson learned from this exercise uh so far in buyer over the past few years is you need the right data in the right representation with the right analysis. So I am I'm so the idea I think going forward is not that we generate the biggest graph in the in the universe explaining everything that that's happening but rather go for specific business questions see how we can improve decision making on these business questions using integrative uh data analysis uh and then go for that uh and the networks in as far as biology goes or as far as uh healthcare goes can can be very different. So when you talk about right data and right representation, you can think about uh classical systems biology kind of networks which where enzyatic reactions, how different enzymes process different signals. Uh you can think about uh things we derive now from high throughput omix data. Uh how cells relate to each other, how cell types relate to genes and so on. Can think about what I just presented you patient centric networks. can think about the uh basian networks with uh with with full probabilistic models around certain uh health entities or diagnostic entities.
Um and you can also think about uh the usual suspect these large biomedical knowledge graph integrating information from a lot of databases and each one of this data representation has has a right has a a usea has a has a valid use cases has valid data that goes into it and has certain analysis that you have that you can do out of it. Um so so um uh the the kind of uh take-home message is consider first what is the business use case that you want to that you want to enable better decision making on see how you can best represent it and and then figure out what is exactly that you uh that how exactly you want to represent it and you want to integrate your data I'll stop here I don't know if there's time for questions uh but yeah this is this there's a lot of people that went uh into making this re real in uh it's a large organization uh nothing happens by just one person a lot of colleagues here were were instrumental to to getting this framework to actually work mention a few here but there are also a lot of others of course all right >> thanks >> thank you so much thank you so >> [applause]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











