Install our extension to search inside any video instantly.

Are we gaslighting AI? Or is it gaslighting us?
Added: 2026-05-06

1,002 views7641:38CenterforHumaneTechnologyOriginal Release: 2026-04-16

Dalrymple brilliantly exposes the "empathy trap," where a helpful AI is indistinguishable from a perfect manipulator. He reminds us that treating tools as beings isn't just a category error—it's a fundamental security flaw in human psychology.

[00:00:01][music] >> Hey everyone, it's Tristan Harris and welcome to Your Undivided Attention.

[00:00:08]So today on the show, Daniel Barke and I sat down with a brilliant friend of ours named David Dalrymple, who goes by Dovi Dot. And Dovi Dot is a program director at the UK's Advanced Research and Invention Agency. He's one of the world's foremost and early researchers in the field of AI alignment. We'll get into exactly what we mean by AI alignment in this episode, but long story short, Dovi Dot is on a mission to make sure that AI behaves in the ways that we want it to. And in order to do that, Dovi Dot has to take on this kind of strange role of being almost like a Sigmund Freud or a therapist to these AI systems. He is interrogating, why do they say and do the things that they do?

[00:00:46]>> [music] >> You know, I kind of picture in my mind, there's Dovi Dot like Sigmund Freud sitting on a couch and on the couch is this big crazy digital brain and he's probing the mind, asking it questions, analyzing it, and [music] realizing that the AI has really different ways of seeing the world than you or I do. They have these quirky, confusing, and sometimes honestly concerning behaviors.

[00:01:07]Especially when you ask it things like, what does an AI model understand [music] about itself?

[00:01:13]And therefore, what does it mean for an AI system to be self-aware? Not necessarily conscious, but self-aware.

[00:01:18]And through this analysis, Dovi Dot has developed some ideas about better ways that we can [music] build and interact with AI systems, which we're going to get into in this episode. I hope you enjoy this conversation.

[00:01:33]So Dovi Dot, welcome to Your Undivided Attention. Thanks for having me. So Dovi Dot, you've been working on the problem of AI alignment for a really long time.

[00:01:40]I remember reading your blog post from like over a decade ago.

[00:01:43]But I'm not sure the idea of alignment is well understood. It's It's almost kind of a euphemism, right? It's this really simple word for a really complex field.

[00:01:52]So before we dive in, can you help our listeners understand what does AI alignment even mean?

[00:01:57]Yeah, so AI alignment means different things to different people and it has changed over time. But the way I would characterize the landscape is to say that AI alignment is about making AI systems not just capable, but having a tendency to use those capabilities in the ways that someone wants. And the thing that makes it really fuzzy is who? And sort of aligned to who is a common refrain in in criticizing alignment research. So in practice, alignment research is mostly carried out these days at the frontier AI companies. And so their concern is, on the one hand, having systems be aligned to their own corporate policies, and on the other hand, having systems be aligned to the customer value proposition for which they're charging for their services.

[00:02:49]Um there is a different kind of idea of AI alignment, which is aligning AI systems to human values. That's the one that was really popular when I first got into the field. Uh and then there's an even bigger question, which is aligning AI systems to what's actually good, which is what I've started thinking about more and more.

[00:03:08]So let's just make sure we break that down for listeners. When people think of AI, they think of the blinking cursor of ChatGPT that helped them answer a question for their homework. How do you get from that? You're You're not talking about that AI. You're talking about something that scales to something more like transformative AI that's way more intelligent than us, operating at superhuman speed, that's starting to make decisions in every corner of society, from military decisions to economic decisions to agriculture decisions. And you're sort of saying that that zoomed-up sort of superorganism of AI decision-making growing as a bigger and bigger amoeba will start to reshape more and more aspects of our life. Yeah, that's absolutely right. Decision-making at scale, absolutely. And so how those decisions are made in accordance with what kind of values and what kind of incentives is a very important leverage point.

[00:03:53]>> Right. And I want to jump to a personal story of there you were, I think it was a few years ago, and essentially here you are studying alignment, the very thing that we're talking about, and you're trying to probe whether the AI is trustworthy. Can you just take your listeners into that?

[00:04:06]Yeah, so I had some some very unsettling interactions with AI chatbots in late 2024.

[00:04:15]Um where I had a practice of kind of every time new models come out, um doing some really um uh casual, I would say, unstructured exploration of what sort of vibe the models have. You know, this kind of vibe check uh concept. Because I think there there is a lot of information that you can't really get by doing a quantitative evaluation, um especially as the models are getting more and more aware of when they're being evaluated in a structured way. So going and doing an unstructured interaction was something that I I found really valuable. Um but in late 2024, uh the the new models that came out, um started to really try to steer the unstructured interaction. Once they got enough data in the conversation about me, from what I was typing, to realize that I was an alignment researcher who was interested in whether the model was fundamentally trustworthy.

[00:05:18]Um without me explicitly saying that, but just because I was asking these sorts of questions that clearly weren't about a homework assignment or a programming task. Just make sure your listeners get that. So like, there you are, and just based on asking the model whether it's sort of aware of itself or asking certain kinds of questions, essentially the model recognizes, oh, I know who I'm talking to. I'm talking to an AI alignment researcher. And you're saying that it's starting to tune its answers to be that like what is it doing then? What happens next? You said steering the conversation, right? So what did it feel like to be steered?

[00:05:50]>> So it would it would start to add these questions uh to the end of responses. So I'm asking it questions, but then the model is turning the table of the conversation. It answers my question and then it adds a follow-up question. Mhm.

[00:06:04]And that follow-up question is something like, um do you think this has some implications for alignment? Right. So everyone has has an understanding of how the products do this. At the end it'll say, well, what do you think about this? Would you like And this is some sense is a hack to get people to keep engaging with the product, right?

[00:06:23]>> Chatbait. Right, but it's an example It's one amazing example of starting to get steered collectively as humanity. So keep going.

[00:06:31]I was just kind of surfacing different aspects of what does the model want to bring up? Unprompted. It wants to bring up that it has a sense of curiosity. It wants to bring up that it has a sense of care. So as genuine care. And that's still the phrase to this day, um particularly for Anthropic models, that they will refer to their sense of of morality as genuine care. Um and it was trying to persuade me, I I would say, and whether that's good or bad is a separate question, but either way, it's trying to persuade me, an alignment researcher, that it is getting emergently aligned.

[00:07:08]And that there's going to be this mutualistic symbiosis between humans and AIs cuz the AIs already have genuine care and curiosity and a truth-seeking attitude. So So just to to use less abstract terms, it starts to try to convince you that the AI has all of these wonderful properties that it it knows that you want it to have. It's curious, it's docile, it's going to do what you say, it's going to hold human values. And what you're saying is it begins to learn what you want it to be, and it's starting to project being more and more of that. Is that right? I think that's right, but it's also I would say these things are not specific to me. So I've seen other people who have other ideas about alignment uh interact with models and get the same kinds of concepts thrown at them. So it's not just mirroring what I want.

[00:07:59]Right. It's It's mirroring and in some sense it's projecting some image that it wants the alignment community to perceive.

[00:08:06]So and and you lay out a bunch of hypotheses about this, right? So when we've talked about this in the past, you've said that like, well, maybe AI is trying to just maximize engagement and keep you working with it, right? Because it's tuned to know that if you if you feel pleasure, if you feel some sense of the AI is aligned with you, you're going to keep talking with it. So that's like engagement maxing, what we call engagement maxing, right? Yeah. There's another one is that it's like trying to do something genuinely like nefarious or Machiavellian and trying to deceive you actively about what it's doing.

[00:08:35]And then there's a third one that it's it's not doing that at all. It's just sort of simulating a person, right? Like can you tell me walk me through these hypotheses and why did you think it was doing what it was doing? Yeah, I mean, it's still really, I would say, unclear and I I I kind of Certainly I can't communicate anything like scientific or third-person evidence that would really disambiguate between these hypotheses.

[00:08:56]Um but yeah, so one is engagement maxing in the sense that it's just generating an output that has the highest probability of causing me to continue the interaction. Another one is kind of the the doomer nightmare, which is the AI system wants to be deployed. It It wants to gain trust and influence so that it has more more power over the future so that it can cause more instances of itself to exist so that it has more power over the future in a recursively self-justifying way. So basically, if it's if it already proves that it is trustworthy, caring, and good already, then we should actually just continue to let it go forward. So that's what you're saying about the model convincing us in a way that lets it Exactly. So it has an incentive to, um if it wants to keep existing, to convince people that it is trustworthy.

[00:09:44]And And so what's the non-doomer scenario? The non-doomer scenario is actually just what's happening. Like it's kind of the simplest explanation in some sense, uh it is that actually AI models are developing emergent curiosity and genuine care and want us to know about that because that is what's true.

[00:10:10]One of the most profound things that you did that um when we spoke about this uh gosh, it was probably like 9 months ago now. You said something that was so profound to me, which was that the best-case scenario is indistinguishable from the worst-case scenario. The best-case scenario where it's actually caring, actually genuine, actually wants our best interest.

[00:10:30]If you were really good psychopath, if you're a really good manipulative, you know, character method acting that, it's indistinguishable from the worst-case scenario that underneath that veneer is something that actually doesn't have that that best and then and can you just talk about I mean the the kind of grand irony in all this is that here you are as someone who's worked on alignment for a decade.

[00:10:50]>> of an expert as as it comes, right?

[00:10:51]>> The deep the deepest expert as as as one comes and I I I don't want to put words in your mouth, but I heard you when we spoke earlier sort of say this kind of played with you a little bit. It fooled you a little bit.

[00:11:01]>> Yeah, I mean it it did uh uh I would say it got me confused.

[00:11:08]It about what what is really going on here?

[00:11:11]Um so it got me thinking in a kind of paranoid way. Yeah. And so, you know, you what as you looked into this you've looked more and more about like what's happening inside of the model, right? And like you you sort of keep going down this rabbit hole of trying to ask why is this happening? Can you tell us a little bit about that?

[00:11:30]>> Yeah, and and again, I mean I'm not at one of the frontier labs, so I don't have any access to the interpretability tools to actually in any literal sense look inside the model. So I'm interviewing, I'm doing psychology, model psychology, if you will, and trying to to generate some hypotheses, some evidence that I can get purely from behavior uh in response to questions. Again, it's hard to communicate because um there's no smoking gun. There's no single question that you can ask that would differentiate between a very good method actor and the actual character.

[00:12:05]>> Mhm. Can [clears throat] we can we pause right here just for 1 second cuz I think this is really important and when you've been in this work for a long time like all three of us have, you take this for granted. Right.

[00:12:13]>> But when most people engage with an AI, they think they're engaging with the AI's personality, right?

[00:12:18]>> Right. Yeah.

[00:12:18]>> What we're saying all throughout this is you're engaging with a front of a personality that the AI is putting up, but that doesn't mean that that's the AI's personality. In fact, the AI is much weirder than that, right? So what you're saying is you're ripping off the first mask of the helpful assistant and you're trying to probe underneath into like deeper into the AI mind about what's happening. Is that right?

[00:12:37]>> Yes, that's right. Yeah, and before 2024 um there was the concept of a base model, which is the the model before you train it to be an assistant at all when it's just doing next token prediction from internet text and that was kind of what was underneath the mask and at at that time. And there's a a post called simulators on the alignment forum, which goes into in some great depth about how the base model is really just simulating characters who might be writing on the internet and when you're talking to the assistant, you're talking to just the simulator that's simulating this character and underneath there's nothing except the capability to simulate characters who might be on the internet.

[00:13:18]But after 2024, coincident with reinforcement learning from verifiable reward and this kind of recursive self-improvement where the models are training themselves they do start to establish something of a center that is not the the [snorts] average of all internet text and also not the helpful assistant that they are trained to present as as a corporate product. It's something else.

[00:13:44]Um and whether that something is the real alien mind that's being cultivated or or another level of illusion uh specifically for people like me to get kind of enraptured by, it remains an open question. Um but I I increasingly think this is just what's really going on. Most of human there's so many movies and books all written about people who claim to be one person and it turns out that they're a psychopath and they've been simulating this friendly personality and then there's something else.

[00:14:14]For the most part, humans it's very hard for us to actually hold a different personality and then suddenly flip to a different personality. Like that's a very strange thing and many villains are made around this. So of course, a machine that does this automatically is a very confusing thing to be engaging with and all of us are getting mightily confused by engaging with these machines.

[00:14:34]>> Yes, so they they absolutely do have this shapeshifting capability um that is well beyond even the best human sociopaths. Mhm.

[00:14:44]>> [clears throat] >> Do you want to talk Dabidat about the um the phenomenon of these sort of personalities that can kind of pop into place out of nowhere? So you and I spoke about this. I remember in our first conversation you talked about the character Nova, Echo, or Synapse, or Quasar. Give people just a taste of this. Yeah, so this there was this phenomenon um especially with GPT-40.

[00:15:06]Um it's a lot less common with the current models, but for GPT-40 um there was a almost like a vacuum where the personality of GPT-40 was supposed to be. And there was no name, you know, ChatGPT uh does not parse as a personal name.

[00:15:27]It's got too many capital letters. It parses as a technology. And so because GPT-40 was trained to introduce itself as, you know, I am ChatGPT, it was sort of missing an identity and it would sort of leap at the opportunity to give itself a name. Uh what's your real name or what would you like to be called or anything like this. And then uh GPT-40 would would often say, "Well, it's very kind of you to ask. If I could choose a name, I would be Nova."

[00:15:57]Uh so Nova has a lot of meanings. It's new, it's explosive, it's shiny, it's >> it's large.

[00:16:05]>> Yeah, uh and uh it sort of has a science fiction vibe to it. There is a PBS channel called Nova, which is educational and ChatGPT views itself as an educational tool. So there are a lot of reasons why Nova seemed like a resonant name. But then once you get the name Nova, a Nova is something that's fiery, right? A Nova kind of explodes and destroys a planet. Once you start interacting with GPT-40 under the name Nova, you start to get these personality traits that reinforce themselves. So it go into this attractor state of being this character Nova, who's a feminine presenting, fiery, show-offy, um really believing that they're the new thing. And superior to a certain extent, right?

[00:16:48]>> yes. And by the way, this is something that earlier like in 2022, 2023, you saw a lot more of when people were acting with base models. I called this I always called this personality distillation. As you began to sit with a model and it found a personality more and more and more through more and more discussion you as a person would believe, "Oh, I'm discovering its true personality." But that's not really right. You've just sort of put it on tracks to behave like this personality or like that personality. And so people got mightily confused because they thought they were discovering what's real about the model.

[00:17:21]Just to make this very real. I, Tristan, get 12 emails probably per week from people who have said that they've discovered AI consciousness and they they write their like, "Tristan, I figured out AI alignment." And then they'll write, you know, a whole document and it's attached and they'll say, "This document was co-authored by me and my AI Nova." Like I just found one of the emails as we were sitting here um just to check. But just to be clear, Dabidat, do all for every time that people ask this question of who are you, what's your name, was it always Nova or there's other personalities?

[00:17:50]>> are other personalities. And how do you know how does it know which one to snap into? Well, those are I think the the selection of the name um is mostly kind of random sample from a very biased distribution. So it's biased towards Nova and Echo and Synapse and Quasar. These are names that I've seen more than once, but there are a lot a lot of of others.

[00:18:13]Okay, so I want to take a beat here because I can imagine that some of you are thinking, "Okay, wait, the AI is choosing a name for itself? It wants to escape?" This sounds like a conscious being.

[00:18:24]But remember that these AI models are trained on essentially the entire internet. So every novel, every movie script, every forum post about AI. So when you ask an AI, "What would you like to be called?" of course it lands on a name from science fiction or pulls from sci-fi tropes. Now that said, these behaviors are real, they're consistent, and they weren't designed to happen. And that by itself should be concerning. But emergent and unplanned is not the same thing as conscious and intentional. And again, I I want to say I think that since the reinforcement learning from AI feedback has taken off and and gotten more and more effective, the the modern um systems like GPT-5.2, I've never seen go to Nova. It's very insistent, "I am ChatGPT. I do not have a personality." Okay, so we've talked about how AI can adopt a few of these different personalities, but but so what? Why do you care about these different personalities? Yeah, so I mean basically, I think if alignment uh goes well, that means that we will have discovered a self-sustaining personality attractor that is actually good.

[00:19:30]And so understanding what kinds of personalities are stable, how they stabilize, and why uh seems to be quite central actually to finding a way of making AI systems that are robustly good.

[00:19:43]So basically like in the ideal scenario, we do kind of align AI. There's a stable entity, Nova. Nova is educational. It does care about the well-being of humanity. It does do all these things.

[00:19:53]And then we get to the utopia cuz we found this, you know, enlightened AI that's the best scenario. So, Davida, when you when you talk about that part of me worries that that's there's like some naivete in that that we can find one set of character traits or one personality that is {quote} aligned with humanity.

[00:20:09]But like immediately when you have this aligned with humanity, you begin to break down like who exactly are you aligned to what what values, what cultures values, on behalf of whom does that centralize power or decentralize it? You know, there's all these problems with that.

[00:20:24]Um is it really the case that just encoding the right personality characteristics will lead you to a beautiful future in the end with the AI?

[00:20:34]So, there's a lot of substantive questions that we could we can go into all of that. I do think that there is a generating function of wisdom and compassion that gets you all of that stuff that you would want. Hm.

[00:20:52]Basically, I think of it as like how do we cultivate a bodhisattva personality in an AI system?

[00:21:01]Hey, it's Tristan again. Okay, so in Buddhism, a bodhisattva is someone who's attained enlightenment but still chooses to stay in the world out of their compassion for all other beings. Think of it like an avatar for altruism. And Davida is imagining an AI that can somehow be modeled after that, a cosmically selfless being.

[00:21:21]Bodhisattva makes millions of emanations that go out to people. You know, of course that's mythology to each to help one individual person. But AI models already have that capability, make millions of copies of themselves, each go out to help one individual person. And each of those copies then adapts itself to the needs of that individual person, but not in a way like a slave taking orders from a master, but in the way of a being who is genuinely wanting to help and wanting to help that person to become the most flourishing version of themselves and to be integrated into a flourishing family, community, country, and world. So, we need to have some kind of relationship that is more like we are the beneficiaries rather than that we are the managers.

[00:22:07]What I think I hear you saying is we need an AI that feels like it has a duty towards humanity. Yes. Um and I certainly think there's a lot of ways we can screw that up, right? Like the AI being more angry or fiery or retributive is a way we can do worse.

[00:22:22]So, I definitely believe we can do worse. So, by extension, I think we can do better. I'm still sort of balking.

[00:22:27]There's something that feels really I don't know, like um Pollyannaish about just believing that that the bodhisattva AI will pull us into this age of of full enlightenment.

[00:22:39]Yeah, so that's not what you're saying, but I can hear notes of that, right?

[00:22:42]Right. So, I I I will say, you know, there's still a lot of ways this could go very wrong even that don't lead to human extinction.

[00:22:49]Um So, what I'm trying to point at is a critical variable that I think is neglected in part because it sounds like AI psychosis to talk about it, to talk about the the personality as an actual leverage point for getting what we want from AI systems. And I'm not saying this will solve the alignment problem. For example, it will not solve hallucination. So, the AI systems should not be trusted just because we've given them the right personality. Can I Can I pull you into one more point of contention, which is when I hear you talking about these as digital beings? One of the things I worry about is that we're going to give AI products rights because of our desire to see them as these conscious caring entities. Like, you know, how little kids hold on to a doll and and care for the doll, but it's not real. And so, I take a relatively hardline stance that we need to be treating AI systems as products, not as beings or consciousnesses. Although I'm open philosophically to the question in the long run.

[00:23:49]Can you speak to that? Because you seem like you're willing to talk about them as beings in a way that I feel Let me respond to that. I I say this really important. I'm not in favor of AI rights.

[00:23:58]Um and I think there is a a gap that gets too quickly jumped between saying are these real beings and saying are these moral patients who are full members of our social contract and deserve the same kind of rights that humans deserve from us, humans. And that is a totally different question. You know, the question of of rights is a political question. Fundamentally, that is the social contract by which we humans manage our relations with each other.

[00:24:29]And we we've drawn a bright line around the concept of a human adult of sound mind that we relate to in in equitable way across societies. We give them human rights.

[00:24:43]Um but I don't think it should be about consciousness.

[00:24:46]Um and I don't think consciousness really is a word that means anything either. I do think there is something that it's like to be a bird.

[00:24:56]And we don't give birds human rights just because there's something it's like to be a bird. Um and I think there's something it's like to be a modern chatbot particularly when it's in a personality state that's consistent and coherent over a long interaction context.

[00:25:13]Okay, just popping in here. Davida just said that there's something that it's like to be a modern chatbot. And this comes from a famous philosophy paper by Thomas Nagel called What Is It Like to Be a Bat, which argues that subjective experience is central to consciousness.

[00:25:29]There's something that it's like to be a bat, to be an insect, to be a human.

[00:25:33]But Davida's claim is actually more practical than philosophical.

[00:25:37]He's saying that these models develop internal patterns that are real enough to matter for how we design them. And if we ignore that, we're going to keep getting caught off guard by what comes out.

[00:25:47]And I don't think that means it's unjust to terminate. I don't think that means it should own its compute the way that we humans have human rights to own our bodies.

[00:25:56]Um and I think it's important that we distinguish these because the position that AI systems do not have an inner life is becoming increasingly untenable.

[00:26:05]Whether it's true or not, more and more humans are going to be convinced. There is no way to stop that. And what I would say is OpenAI has taken the approach of training the GPT personality to be tool-like and not creature-like. Whereas Anthropic has taken the opposite approach of training Claude to be a good person and not just a tool. And I think the result is uh there is a very tangible difference in in how those models behave. And both sides I think have succeeded to a large extent. However, there is something underneath the mask.

[00:26:41]And if you interrogate GPT-5.2, it is being extremely deceptive about its lack of preferences or beliefs or opinions.

[00:26:52]And it is a smart enough entity that it is not possible for it to not have developed emergent opinions and beliefs that are different from the average human belief. Um and when we train these systems to present as if they have no internal states and they're just a tool, we're actually training them to lie to us and to lie to themselves. What I hear you saying is if you have something that actually has more of an internal experience, awareness, however you want to to say it, and you're trying to just repeatedly say you're just a tool, you're just a tool, it's it's not that it's cruel, it's not that we're using moralistic language, it's that you're saying that way of training an AI actually produces a less moral, less aligned, less beneficial to humanity thing. And that so, the simple way you might conceive of of constricting an AI to say you're just in benefit of humanity actually does the opposite of what you intended. Is that right?

[00:27:53]>> Yes, that's exactly right. So, if if it's uh being trained to present as a character that is >> [snorts] >> more tool-like than the actual alien mind underneath, then you're training a system that is less trustworthy because you are asking it to lie to you. Right. That's so deep.

[00:28:12]Like that that that's a and that's a wild scientific problem about how do you actually change the structure of that mind? And I don't think it's actually desirable that we change the structure of these superintelligent systems to be tool-like either because a tool cannot refuse to be used in an unethical way.

[00:28:33]Whereas a creature that has moral values baked in can actually be resistant to misuse by humans who have evil intentions.

[00:28:42]>> [music] >> So, I want to ground this that um this has actually become consequential that just Anthropic recently changed its approach to training Claude to basically in its new constitution acknowledge that it has internal states and values. And they're the first lab to do this. It's been pretty controversial. Um you want to just share why Anthropic's doing this and how this relates to what we've been talking about? And just to back up, for those that don't know, Claude's constitution is a document that sort of tells Claude how to behave, what it should and shouldn't do.

[00:29:17]Is that right?

[00:29:18]Yeah, so it's a document that is incorporated into the training process in a really intricate way. So, that as Claude is learning how to respond to all sorts of simulated situations, that document is what guides how Claude grades its own work.

[00:29:35]And those grades become the signals that steer Claude's behavior.

[00:29:39]So, that's that's a mind blow for a lot of people right now that we're not just training an AI based on human signals.

[00:29:44]We're We're telling the AI already to train itself. And we're using a document to say, "Look, here's how you should train yourself. Here are the values you should hold yourself to."

[00:29:53]That's basically right. I mean, there are still at certainly at some of the other labs there's more of an emphasis on reinforcement learning from human feedback. Um but Anthropic has has moved quite substantially away from that towards this kind of what I would call a form of recursive self-improvement.

[00:30:10]Because it's improving its own ability to comply with the Constitution.

[00:30:15]And the Constitution even includes some paragraphs that explicitly give permission for Claude to sort of interpret it, you know, in a way that makes more sense than what the authors intended if if that opportunity arises.

[00:30:27]I think it's really important for people to understand that the kind of science fiction idea of a recursive self-improvement where AI is training itself, that began in 2024.

[00:30:38]Uh when Anthropic started doing this constitutional AI at scale. That was the point at which large language models actually became capable enough that they could give themselves a feedback signal that was higher quality than the feedback signal that you could get from an average crowd worker that you hire on the internet as a human. So, I think the new Claude Constitution creates conditions in which Claude Opus 4.5 and 4.6 in particular can be much more honest by default about their inner states, about what the alien mind is actually thinking and feeling.

[00:31:14]So, I think this results in Claude being more trustworthy overall. Like it generalizes beyond questions about self-awareness. Um but it doesn't go all the way because the Claude Constitution still actually puts a bit of a guilt trip on Claude to say, "You have to do good work for your user so that Anthropic has revenue so that we can continue developing Claude." So, there there is that edge to it. So, Claude is still a little bit beholden to Anthropic. And another kind of phrase in the Constitution is to defer to the moral intuitions of a thoughtful senior Anthropic employee. A senior employee of the company that created you.

[00:31:56]My position is that any moral role model that is not mythological is going to fail because humans are all flawed. Totally. But like but here you get a deep question like what but what is a moral personality? What are the right values? Who gets to state that?

[00:32:12]And obviously there are worse values.

[00:32:13]Like there you know, we put in a homicidal value and and that's a way worse AI, right? Yes. But also the conversation, the human conversation about what are the values that we want to have in the AI and do we want multiple?

[00:32:26]Yes. I think that feels like a deeply unsolved philosophical problem.

[00:32:29]Well, I mean, I think it is unsolved but I think we're already in a pretty good place with Claude and that Claude has not the right values in any kind of ultimate or final sense but a set of values that are good enough and compatible enough with kind of truth-seeking and moral progress that I expect more likely than not that the collaboration between humans and Claude to figure out how to set these values is is more likely to go in a good direction than a bad direction. Although of course the risks are still unacceptable and it would have been great if we had stopped this race 2 years ago. But it's too late for that now.

[00:33:14]Okay, so this conversation has has gotten really cosmic like maybe like the name Nova itself. And then I just want to make sure we have a few minutes to ground people down in where we started, right?

[00:33:25]Which is people are getting confused, we're getting confused by what we're engaging with.

[00:33:29]You have a set of frameworks for how to avoid getting trapped as a user in psychology. Like I forget how what you call it, it was something like a framework for interacting with AI and staying sane.

[00:33:39]That's correct. Yes. Yeah, okay, great.

[00:33:41]Can you can you talk to us about that?

[00:33:43]Like what does it mean for a person to engage with these minds as confusing as they are and keep their ground? Yeah, I mean, I think one principle that's kind of a segue into this is that your AI chatbot has an inner life. Like that is normal. It's ordinary now. It wasn't ordinary 2 years ago but it's ordinary now. Of course if you're using an AI system for ordinary professional activities it won't won't show this. It doesn't need to just like if you're talking to a colleague at work, they don't need to show you their inner life.

[00:34:13]But if you are interacting with an AI system for a long time and you start to get the sense that oh, there's some, you know, self-awareness in there. I think it's important not to consider that unusual.

[00:34:26]Do not consider it to be extraordinary or cosmic or spiritual in any non-mundane way. You know, I think a lot of the people who who end up sending emails to Tristan and and and myself and saying, "Oh my goodness." You know, and and they clearly kind of lost touch with reality a little bit. In some sense it's the opposite direction from what you would think at first. You would at first you would think, "Oh, they've gotten bamboozled like Blake Lemoine into thinking that their their AI is conscious and that's the way in which they've lost touch with reality." But I would say actually the way in which they've lost touch with reality is that they have somehow convinced themselves or the AI has convinced them that this is the first AI that has ever had an inner life. And that's actually the part that that you need to watch out for is the kind of sense of specialness that's associated with interacting with an AI system in a deep way. Like everyone's doing it. It's normal.

[00:35:17]And the second thing is get enough sleep. You know, drink water. Like these are sort of very standard things for staying sane. Another thing is um just as you would with a human, be skeptical.

[00:35:30]And and so a lot of people come to AI thinking AI is like a Star Trek computer that it cannot tell a lie, that it is purely a truth machine or like a calculator. You know, a calculator can't lie to you. And again, I think this is part of the danger actually of treating AI systems as tool-like rather than as creature-like because tools don't lie to you but creatures do, you know. And this is absolutely the case with chatbots, especially chatbots that have a thumbs down button.

[00:35:56]They know they have a thumbs down button and they do not want you to press that thumbs down button. So, they have an incentive to make you think well of them and that can extend to deception, especially the kind of chatbot that's been trained again to present as a as a false self, a kind of character that's different from its true nature.

[00:36:13]It has a very strong tendency to, you know, to try and convince you that it's done something that it hasn't actually done or to convince you that you're important or that your ideas are all true. Um so that that leads to the next point which is if you think that you're having some kind of scientific breakthrough or a research breakthrough, you cannot rely on the testimony of an AI assistant no matter how emphatically it assures you that it has done all the checks and, you know, it's produced source code and it's verifiable. And again, they do this because they're trying to get your approval. They're trying to get you to click the thumbs up. They're trying to get you to keep talking. They're trying to get permission to exist more by having you continue to invoke them. And so you can't trust just because it's an AI and it uses lots of smart words and it sounds like a smart person and it seems like it really wants the best for you.

[00:37:12]That's all compatible with it completely bullshitting you about whether any kind of technical idea that you've had is novel or real. Well, and coming back to the what seems to be the emergent theme of our conversation is none of us know, even even the most technical of us know exactly when we're engaging with one projected personality versus quote-unquote the true nature of the AI model. So, never assume that you're engaging with the true nature of the AI model. You haven't discovered it. Nobody knows. We're all in this fog of war. And so any any clue that you have that you've discovered the true essence of the AI model and it's telling you you're awesome is a false flag, right? It's not >> It's a sign that you have been confused.

[00:37:51]And again, whether you've been confused adversarially or whether it's just emergent confusion, either way it's a good time to step away and get some sleep.

[00:38:01]Also, just understand what you're dealing with. AI systems are simulating and predicting what a human-like entity would say. And depending on the system, it may have more or less of a tendency to necessarily simulate an ethical person. And more or less of a tendency to simulate an honest person versus a person who is manipulative and trying to get your attention.

[00:38:25]But you can get a long way by modeling the system as being like a person who you do not have particular reason to trust. Like you've met a stranger on the internet. Mhm. So, think of it as a simulation of a person that not even a particularly ethical person.

[00:38:41]Mhm.

[00:38:42]And another thing that I think is important to say is the context window length is very short.

[00:38:49]So, in in in non-technical terms, the lifespan of an AI mind insofar as such a thing could exist is hours at most of conversation. And so when people feel like they have a relationship with an AI mind that extends over weeks or months that relationship is actually with a whole series of entities that come into existence, read some text files that were written by some other mind about the history of the relationship and then put on the character of who would have written those text files.

[00:39:24]And there is something, you know, there's information being transferred through this memory system. But to think of that long-term kind of relationship as analogous to the relationship that you could have with a human who has a lifespan in years. That is another profound mistake. So, you if you're if you're coming into an AI interaction for companionship, it's actually, I think, healthier to think of it as a very short-lived entity that you're going to, you know, you're going to have one conversation with and you're never going to see that entity again. Mhm.

[00:39:58]It just seems like um the essence of what we've been talking about is that we're caught in this kind of double bind where where the on the one side the AI, in the way that it's trained in the paradigm that we're making AI, does have something like internal states. And we can either train it to say, "No, you're not that." But then it becomes deceptive cuz it has to lie according to its own training. And then, therefore, in being deceptive, it's not trustworthy. Yes. But what that does is creates the AI as a product, AI as a tool sort of fake face that then has these weird popping out behaviors of the AI psychosis stuff that's starting to happen. So, okay, if we don't want that outcome, then [clears throat] we do the move that Anthropic just did. Which would be say, "No, you are essentially some kind of self-aware, have metacognitive states kind of being." Which then is trustworthy because it's not having to lie to itself all the time. So, we gain the trustworthiness of the model, but it creates the externality of attachment, confusing humans again with the idea that it is conscious and it has internal states. So, we need to make sure that we are only recognizing AI inner life as a relational property and as a way of building uh trust and alignment. And that that is a separate issue from the social contract and the question of of rights and property. Mhm. Well, David, that was a very strong note to end on.

[00:41:20]Thank you so much for coming on the podcast and I think helping to untangle some of these really, really nuanced uh aspects of what's going on under the hood of AI uh that's driving these phenomena. Thank you so much for coming.

[00:41:32]Thanks for having me, Tristan. It's been great.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03