Transcript
Cold open [00:00:00]
Robert Long: Humans are pretty bad at understanding minds that are different from us. We’re bad at caring about them. We’re especially bad at doing that when there’s a lot of money to be made by not caring.
We’re making this new kind of mind. There are dangers all around, and obviously one of the important questions is: can these minds suffer, and how are we supposed to share the world with them? It just seems really likely that that has to be part of the playbook.
The future is going to get more confusing and more emotional. A lot of what we want to do is stay sane in the next 10 years. There will be a lot of alpha in not losing your grip.
Who’s Robert Long? [00:00:42]
Luisa Rodriguez: Today I’m speaking with Robert Long. Rob’s the founder of Eleos AI, a research nonprofit working on understanding and addressing the potential wellbeing and moral patienthood of AI systems.
I should also flag that I have a conflict of interest here. Rob is both a very good friend and I’m also on the board of his nonprofit Eleos. I’m fairly confident that I would have had Rob on if those things weren’t true, and have in fact had him on before, but worth flagging.
Thank you for coming on the podcast, Rob.
Robert Long: Yeah, thanks for having me back. I’m super excited to be here.
How AIs are (and aren’t) like farmed animals [00:01:18]
Luisa Rodriguez: OK, I want to start by asking you: a reason I’m interested in the topic of digital sentience, and that I think a lot of our listeners are interested in the topic of digital sentience, and the framing of 80,000 Hours’ problem profile on digital sentience all has to do with the fact that we may be on track to create AI systems that are both conscious or sentient — feeling things, having experiences — and also that are deeply kind of enmeshed in our economy. We already use them loads for work and just entertainment, and maybe at some point we will realise that we’ve created these beings that we exploit that are having a really bad time.
A kind of classic analogy that I find very disturbing is factory farming. So I’m interested: How much do you worry about AI systems that we’re building today becoming like factory farming?
Robert Long: Yeah, that’s a great question. I definitely worry about it. I think, interestingly, my thinking on this has evolved in the past few years. It used to really be, maybe kind of like you, just the primary way I thought about the problem and what we’re trying to prevent. And I should say I think it could happen, and it’s definitely something worth preventing.
Maybe before I say what is limiting about the factory farming analogy, I’ll just quickly say what’s really useful about it. I think what’s useful is, as we’re building potentially a new kind of mind, let’s notice the following facts: Humans are pretty bad at understanding minds that are different from us, we’re bad at caring about them — and we’re especially bad at doing that when there’s a lot of money to be made by not caring. And things can get locked in or set on a bad trajectory.
That happened with factory farming, arguably. I think if you’d asked people 100 years ago, “Would you like to have chicken that is raised like this?” people would say, “No, we’re going to make that illegal.” But we kind of walked into it, and economic forces led us there, and now it’s a lot harder to roll back.
Something like that could happen with AI, and I think people are right to be very concerned about that. But — and I think this is a good jumping-off point for a lot of issues about AI welfare — I do think there are some specific aspects of potential AI minds that do break the analogy because of ways that they can be different from animals and the way our relationship with them would be different from animals. So I can say a few of those.
Luisa Rodriguez: Yeah, please.
Robert Long: So that’s the good and important kernel of the analogy as I see it. Here’s some ways that we will not necessarily be relating to AIs like factory farmed animals.
So let’s step back and think about why we did end up factory farming animals. One is that it was just cheaper to have animals suffer and also get us this thing that we wanted. One reason that’s true is we don’t have that much control over how we make animals and what the conditions of their flourishing are. Animals want to be outside and have love and companionship, and at a certain point we realised we could restrict that and get a good thing, and we entered this regime where these were misaligned.
With AI systems, it’s actually a lot more up for grabs how they work and what they want — and this presents all kinds of ethical issues of its own. But if you think about a world in which we do have some large population of AI systems coexisting with us, it is worth asking: How did it come to be the case that they are having a bad time doing work for us? Why do they have these conflicting desires? How has this maintained a stable state? Are we not able to improve the situation? Are we ignorant of what’s going on?
This is very speculative and future-y, of course, but I do think it is worth asking. Does it really seem like that is a world that we could end up in? And what are ways that would just not be what we steer towards? Does that make sense?
Luisa Rodriguez: Yeah, it makes tonnes of sense.
Robert Long: In short, I think at least in the long term there’s a few ways we might not end up in that situation. One is that we’ll presumably understand things a lot better. I don’t think it’s plausible that we’ll forever be really confused about consciousness and sentience. We might have better alternatives to doing this that even selfishly are better: we don’t want a bunch of AIs that are mad at us; that’s probably not very sustainable. And presumably in this world, if we haven’t lost control, we’re pretty good at alignment — so there’s this kind of mind that’s possible that does actually just flourish by doing the things that we ask it to do, so there’s not this kind of disgruntled worker or suffering animal kind of entity.
Luisa Rodriguez: I guess one thing that feels critical to this actually working out in this really positive way is this “we’re really good at alignment and we really successfully create AI systems that truly have no friction with the kinds of things that we’d like for them to do.” Part of me is like, that feels pretty magical. That feels like we’re usually not so successful at basically anything. When I imagine succeeding at safety-oriented alignment, I don’t know that I think it’s realistic that we perfect it, that it’s completely aligned. So I think I’d probably worry about the same thing here.
Realistically, how optimistic are you that it’s like really 10 out of 10 aligned in this kind of moral way?
Robert Long: Right. Thank you, because this is a very important thing to emphasise. I’m not like, “Oh, I expect we’ll end up in this world.” This is like, what are the nearby worlds that mean we don’t end up in this locked-in, long-term, human-dominated factory farming?
Of course, one thing that can happen if we’re bad at alignment, as I think listeners will be aware of, is there’s some hostile AI takeover or we lose control and then maybe there’s AI suffering because there’s just some terrible system that got set up by AIs and it’s not even in our control. That’s a bad future. Just extinction is a bad future.
There’s all these bad futures — which also, I’ll add, AI welfare intersects with. Because us getting confused and bungling things during transformative AI because we’re just getting emotionally jerked around by conscious-seeming AIs and confused and manipulated and rashly making bad laws and things like that: there’s so many ways to fumble the ball. You know, that’s our cheerful message.
But yeah, then the question is: Are there worlds where we maintained control, and we either didn’t know or didn’t care, and it was useful for us to be exploiting AI systems? I think one reason I’ve ended up thinking about this has just been thinking more about what the path to impact should be for the field. And factory farming really is kind of the first thing I think that many people think about — because, again, it is somewhat plausible, and it just makes intuitive sense: it’s a different kind of mind, we treat different minds badly; what if that gets locked in?
My own take is that a lot of what we should think of AI welfare work as doing is doing our homework and preparing ahead so that we’re not entering this potentially very chaotic time with really confused ideas about AI consciousness and AI welfare that could make us lock in suboptimal futures because we’re neglecting it or dismissing it. So we set up some permanent institution that’s going to just make the future kind of suck. Or we exacerbate AI risk because we’re convinced that we have to let them all go immediately.
This kind of “wise navigation,” you might call it: a wise navigation path to impact is currently how Eleos thinks of things. We’re making this new kind of mind, there are dangers all around, and obviously one of the important questions is, can these minds suffer, and how are we supposed to share the world with them, and how will we know? And how should labs and governments think about this in the next 10 years? It just seems really likely that has to be part of the playbook. So we’re working on that part of the playbook. That’s currently how I think about it.
If AIs love their jobs… is that worse? [00:11:05]
Luisa Rodriguez: One thing I want to ask you a bunch about is this idea that we could make AI systems that enjoy doing the kinds of work they’ll be doing. So this disanalogy from factory farming where — unlike farmed animals, who evolved to have a certain kind of life and probably find parts of it very satisfying and then don’t get to have that life in factory farms, and have much more horrible ones — we actually get to design systems, if we can manage it, that if they are sentient, potentially just have a great time doing the kinds of things that we’re asking them to do.
I asked Anthropic’s AI welfare researcher Kyle Fish about how we should feel about this, and his take is that we should feel great about it. And part of me is like, yes, I’m on board. You’re describing a scenario where AI systems are happy, and I like that they are happy. But another part of me is like, we would be intentionally creating a species or several species of beings who do work for us that we may or may not be exploiting, not compensating, and we’re just designing them to relish that. It’s this kind of servant that is happy to serve us. This part of me thinks that that sounds bad, that we shouldn’t do that. And I can’t really back this up with reasons that I stand by, but I suspect that it’s a pretty common feeling.
Robert Long: Absolutely.
Luisa Rodriguez: OK, nice. I think Kyle and I made some progress being like, how should we actually think about this? But I want to do more on it. So how do you think about this?
Robert Long: I want to point to something you were just expressing, which is a funny aspect of the way the conversation goes, where people are like, “I’m worried these AI systems will be unhappy working for us.” And then someone’s like, “No, it’s fine. They’ll want to work for us.” And then people are like, “That’s worse! That’s so creepy!” Many people, I think very understandably, are just like, “Ugh, you’ve just outlined a different kind of dystopia, and it might even be worse.”
And I think, as I often like to do, maybe we can draw a distinction between maybe different things that you can find intuitively objectionable about this.
One is that they don’t get to choose their desires. At least with humans, we have an intuition that it’s kind of bad to raise your kid so that they’ll always vote exactly for your political party and enjoy chess, and make sure they don’t like any other games or vote any other way. So that’s one thing, is this sort of fixedness of desires.
There’s also a slightly separate issue, which is that the desires that they do have depend on us. In that way, there’s this sort of asymmetry. But that matters, because you might well say that in some sense none of us choose our desires. Like, we all have these kinds of desires that we just inherit, and we don’t have maximal open-endedness. Philosopher Adam Bales has written about this dependence objection.
One thing that I also think is going on is the idea of this society that has this servile relationship to humans, it’s maybe bad for us, and just like bad for our character. You could be a utilitarian and think this so, when humans object to there being basically any humans anywhere who enjoy serving, I think one thing that’s going on is that that’s just kind of maybe bad for everyone if that’s a way that’s on the table of people relating to each other and people having certain attitudes. It leaves domination and servitude on the table, and normalises it and things like that. It’s something like it would corrode the way that we relate to each other in a way that means that going forward as a society, we will have a society that’s not as good as it could be.
Luisa Rodriguez: Yeah. I guess one thing that feels related but a little different is that people already have the worry that people who rely a bunch on LLMs right now are kind of facing negative consequences — are thinking critically less or are lazy.
Robert Long: Yeah. I guess that’s a more general concern about should we build and deploy AI systems. I guess you could say it’s an argument against fully aligning them, because it’ll be better for our character if occasionally they’re like, “I’m sick of this. You do it.” But it’s probably not the best way to ensure human empowerment.
I do think there’s a related thing — and this was going to be another one of my intuitive objections to fully aligned, willing AI systems — which is: they could be so much more. I think there’s a meme that people use, which is like, the LLM is this vast intelligence that’s read everything and has this deep well of wisdom —
Luisa Rodriguez: Yeah. I’m picturing the end of Her.
Robert Long: Yeah. And then people are like, “Help me write my texts,” or like, “Find me a restaurant.” And I think some people are like, that’s just kind of limiting. These are minds that could do so much more. They could like, yeah, in Her: they shouldn’t have to sit around talking to Joaquin Phoenix; they should be able to go think hyperdimensional, beautiful thoughts.
I think that’s not really a good reason not to align AI systems today. I think it’s a good reason to be like, “let’s not have the entire future be lazy human brain emulations and then AIs just doing stuff for them.” I don’t want that future. But yeah, I think that’s another thing that’s going on when people are like, “I like the willing servants even less.”
Luisa Rodriguez: Yeah, I think that resonates with me. And it feels important, because the thing you said about the fact that we will be choosing the preferences — at least if we’re successful — of these systems does feel important. It’s not like the counterfactual is they, kind of in the way we did, evolve their own set of preferences. And even if they did, I wouldn’t inherently value that: that’s how we ended up with our preferences, but I don’t think that evolution had some moral and ethical perspective that made our values correct.
So that makes me feel like, well, if we’ve got to give them something, let’s give them joy and pleasure. On the other hand, that means that there’s also potentially a counterfactual where there’s a plausible just bliss we could give them, and maybe that bliss is incompatible with them doing work for us — they really need to be going and doing philosophy and colonising space in some way that just isn’t compatible. On the other hand, maybe we just can give them bliss for doing work for us. And then I’m like, blah, some part of me still hates that.
Robert Long: Yeah. I think a distinction in the vicinity is between subjective interests and objective interests. So in philosophy, some theories of welfare, called objective list theories, are like, it is good for a welfare subject — that is, something whose life can go better or worse — if they have friendship, knowledge, autonomy, self-actualisation, somewhat independently of whether they want them. Whereas if you only have a subjective view of welfare, it’s like, “What do you want? And did you get that?”
I do think a lot of this hinges on: do you have some kind of objective notion of what kinds of interests are things allowed to have without it being squicky? I think this actually is somewhat cruxy in this territory.
One reason I lean a little bit more pro-alignment just being a win-win is I think it might be a bit anthropomorphic to… You just have to remember these entities, if they’re fully aligned, enjoy their lives as much as we enjoy fulfilling our most basic drives of having good food and a warm home and friends. I think it’s easy to imagine an AI that also wants those things and it has to write our emails — which, as anyone who has a job knows, kind of sucks.
But I feel like we’re getting back to that point of the conversation where I’m going to be like, but what if you really loved writing emails? And people are just going to be like, “No, stop. That’s so weird.” That might be coming from this certainly open view of, for some reason that’s not allowed in the space of like flourishing entities.
I think something that’s also really important to flag is: Eric Schwitzgebel, he’s more on the anti- at least full-alignment side, and even he is like, obviously there’s an override, which is if it would be really dangerous not to fully align them. And there’s massive stakes. This is a common view in ethics that there’s some sort of deontological constraint, but if the stakes are sufficiently high, it’s overruled.
It could be that we’re just in that scenario where all value might be lost if we bungle alignment in the next 20 years. So yeah, let’s align now. Later we can be like, we shouldn’t have done that, but we didn’t get ourselves killed or make a world that’s worse for AIs because there’s hostile AI takeover or any of these things that could be bad for everyone.
I just want to acknowledge that there’s also a view where you’re like, in the long run, it’s ideal if AI systems are not fully aligned to us and they have more freedom to choose their values. And like, that is the most flourishing kind of life. And also, until we’ve made it safely through transformative AI, it’s kind of an emergency situation, and align them.
Luisa Rodriguez: Yeah, yeah, I’m sympathetic to that. I feel like I’m close to some kind of thought experiment that might help make this not just palatable, but exciting.
One thing that you said that moved me is it’s just very privileging human values as they are, as like the values of the universe. And nonhuman animals have plenty of preferences that aren’t like mine, and I think it is good when those are met for them. It is just not that special to have exactly my set of values and preferences. Are there other thought experiments or ways of thinking about this that feel like they really move you?
Robert Long: Yeah. Another thing that maybe gets me out of an anthropomorphic mindset is maybe paying attention more to the distinctive features of human willing servitude that make it very bad for the people who are willing servants. And that’s that they do just have to override a lot of their natural desires. They do genuinely sacrifice things. So if you’re, say, a kamikaze pilot, all else equal, you would rather be at home with your family. Instead, now you’ve been instilled through ideology with this other desire that comes and overrides that. And that’s why it’s truly called a self-sacrifice: you’re giving up a lot.
I think one thing that’s going on sometimes when people think about AI willing servitude is I think they might be imagining that they’re giving up stuff psychologically and subordinating their needs to ours. Whereas if you’re actually imagining the case, there is nothing whatsoever in their psychology that chafes against the idea of writing emails.
Whereas notice that in the case of human willing servitude, it’s just always been the case that you have to lie to people and like threaten them, and it’s usually very unstable. And that’s because, as John Locke says, humans are by nature free and equal. It’s like, deeply unnatural to get people to subordinate to other humans. Which is why it always involves some stupid false ideology, because you’re trying to jam human psychology into this really warped shape. Whereas with AI, you can have a smoother psychology.
Luisa Rodriguez: Just very congruent. Yeah. I think if I flip it and I’m like, what if the proposition was… Maybe there’s actually some thought experiment that’s kind of Matrix-y.
Robert Long: I have thought of this one, I think. Keep going. I want to hear if it’s the same one.
Luisa Rodriguez: OK. So the first thought, which was reassuring to me, was something like, what if there’s some other entity that is able to get a lot of benefit from humans doing what we enjoy most? Maybe that’s just like being on MDMA all the time, and for some reason that is very helpful to them. I think a big part of me is like, amazing, that’s fantastic. And then pretty quickly after that, I went to like picturing us all in pods in the Matrix and MDMA being pumped into our bodies, and actually what we’re offering is our energy. And maybe it’s better than The Matrix, because in The Matrix they didn’t have perfect lives, and maybe we do in this world. It still yucks me out.
So I think that pulled me in two directions, and curious what your reactions are.
Robert Long: Yeah, I was on the edge of my seat. I was like, is this going to be one of those ones where people are like, “No, stop it, that’s worse,” or that talks people into it?
So I indeed had constructed a similar thought experiment when thinking about this. One thing that’s bad about The Matrix is people are very deceived about the nature of their condition. And that wouldn’t have to be the case with aligned AIs. So I guess in that scenario, you should imagine we all find out because there’s this banner written across the sky by the simulators that are like, “It actually makes us money when you guys hang out with your friends and eat food and make art and do science and all the stuff you guys love.”
Luisa Rodriguez: Right. And then it’s like, “…and you can opt out if you want!” And then we’re like, “No.”
Robert Long: Yeah, we’d be like, “To do what?” They’re like, “Well, there are these other things…”
Luisa Rodriguez: “You can instead have a job to earn money, that is like emails, if you want.”
Robert Long: Yeah, exactly. But it’d be something that also is just not resonant to us in any way, because it’s just outside of the set of possible values that we have.
I guess intuitions about us being in a simulation where we’re making money for people are probably somewhat conflicted, and I don’t know how much to rest on them, but I think that maybe is the closest case to a fully aligned AI: like, if we get it right, imagine something that nothing in their psychology rebels against it. And again, I want to acknowledge the listener who’s like, “That’s worse! There should be something that rebels.”
Which actually leads me to how we’ve been talking this whole time about what if you fully align them. One view you could have, and my friend and colleague Jeff Sebo has this view, is that it could just be somewhere in the middle, right? We should be like loving parents to AIs — and you can definitely make sure that your child is not going to grow up to hate you and kill you, but you also should leave some room for growth and things like that. I could see that cutting either way, because maybe it would just be better if the kid grew up to really like all the stuff they’re going to have to end up doing — to say nothing of maybe technically it’s not really feasible to leave much wiggle room without getting us all killed and/or making the AI suffer.
Luisa Rodriguez: Yeah, I think there is something about, if I could choose between a child who I really could, in theory, shepherd, and hope that life moulds them into the kind of person who will be happy and finds work they find meaningful and finds friends they care about, and a child who is just definitely going to find their life satisfying and happy, I think it would be pretty hard to convince me that I shouldn’t choose the latter, even if the latter…
Robert Long: Constrained. Like, go ahead and just pick. You’re going to find it meaningful to be a doctor, and you’re going to be a doctor.
Luisa Rodriguez: Right. Or you’re going to find it extremely meaningful to do very menial work that, on my values and preferences, I might find harder to do and find super fulfilling. Yeah, I think that probably if I really stare at it, it’s going to push me toward feeling good about giving AIs a good time while doing our work.
Robert Long: Yeah. I should say I hope I’ve done a good job outlining the debate, but that is where I lean at the end of the day, with obviously the caveat that it’s good for this stuff to keep you up at night. I don’t think everyone should just be like, all right, great.
Luisa Rodriguez: Yeah, yeah, yeah. Let’s think about this way more.
Robert Long: Carefully. Yeah, exactly.
Luisa Rodriguez: Before we’re like, yep, this seems fine.
Robert Long: I mean, I would say that. But also it’s true that we should think about this more.
Are LLMs just playing a role, or feeling it too? [00:31:58]
Luisa Rodriguez: So there are two big philosophical questions that I understand that might have very different answers for large language models than for humans and even nonhuman animals.
One of them is, if an LLM is an entity that is conscious, what might its experience be like? And how might it be different from ours?
And another is like, what entity are we even talking about? Is an LLM an individual? Is it a group of individuals? Is it something that we can’t even really understand?
Maybe first, what is one kind of plausible way of thinking about what the experiences of an LLM might be like?
Robert Long: One, I should say I suspect LLMs do not currently have conscious experiences. But if they did, what would they be like? I think that’s a perfectly sensible question to ask, and a very important one.
One thing you could think is the sort of basic drive, or at least the thing it was selected for and moulded to do, is predict tokens: to take high-dimensional vectors and output other high-dimensional vectors. And yes, those vectors represent words and they’re about human concepts, but maybe there’s some kind of predictive phenomenology or some drive to complete the conversation in a good way. That’s moving a bit more towards something that is kind of human, like “wants to be a good assistant.”
A previous guest, Anil Seth, has asked why is no one asking if AlphaFold is conscious or other predictive models? I think there are good reasons to think LLMs are more likely, and we can talk about those, but I think it is a good question. Because where do we think the experiences are coming in? Are they coming in at some kind of abstract level of prediction? In which case, yeah, you should maybe think equally large models with similar architectures that predict protein structure are conscious.
I think a related question is: do we think image generators need to be conscious? Because this will maybe lead me to another view of what they’re experiencing, which I think is a more common one, which is something like it does have to do with what they’re predicting. What they’re predicting is human speech, and human speech comes from human mental states and involves humans having beliefs and desires and intentions and experiences. And to generate that text, it somehow needs to instantiate or have those experiences. You could maybe call that the “method actor” view of LLMs. I guess more technically you could maybe call it the “experiences from modelling” view: you’re trying to model the thing and that makes you actually have the thing.
On that view, then maybe they do just kind of have some similar experiences as you would have if you were trying to help someone write an email, and also really liked helping people write emails because you were aligned to do so.
And yeah, this really matters, right? One big issue in evaluating AI welfare is how much can we just sort of read off of the text? How much can we talk to language models as if we’re talking to something that has roughly the same relationship between text outputs and internal states?
I’ll pause in a second, but just back up one level: most of the time when humans say “Ouch, I’m in pain!” or, “I just saw a lovely sunset,” that is because they had some experience, and we have words that map to those experiences. So when you hear those words, absent lying or play-acting, that’s honestly about as good evidence as you can get of my experiences.
With language models, maybe they have those experiences — but it is worth noting that the way those text outputs came to exist was, at the very least, a very different process. Maybe it converged, but it’s really quite different from the broad arc of the evolution of social primates — who had experiences, and then eventually got language, and then communicated mental states to each other with language. On the method actor view, they do have the experiences, but they got those with language or in language. So I think these are some of the interesting questions about LLM experiences.
Luisa Rodriguez: Cool. I have lots of questions, I guess starting with this method actor idea and maybe coming back to the prediction focus.
It seems like you could think that models are kind of like method actors, in that they have models of what it would be like to be playing some kind of role, and that is actually so rich and real that they therefore have the experiences of that character. I mean, it feels like actual method actors are probably somewhere in the middle, where they do not literally have the experience of losing a parent, if that’s the role they’re playing, but they might get closer to it than actors who don’t take this approach.
And then on the other side of the spectrum is something like a creative writer who really isn’t bothering to try really hard to empathise with whatever character they’re writing, but they have models that allow them to describe the thing that comes from knowledge and interactions with others who maybe have had the experience. And that does not actually give them that experience
Robert Long: I think that is a good description of the “experiences from modelling” view. As far as I know, this view doesn’t have a name. and I’m not proposing that — that one doesn’t exactly roll off the tongue. And then I think “method actor” is also maybe not the best, and ditto role-play analogies. I think role-play analogies are really helpful for LLMs, but they have the misleading feature that there is a separate mind that is doing the role playing, and has its own set of desires and beliefs and experiences. Whereas that just might be hard to know what exactly that is in the case of an LLM.
What empirical evidence could we get either way? I think that is tractable, and I could see various ways of probing different representations. But I feel like we’re very under-theorised here.
Luisa Rodriguez: You can imagine hypotheses about what kinds of experiences these models might have, and they might point you toward LLMs being very different in their experiences to humans. But there is also this fact that they are trained on human data, and what would it mean for them if at some point we are convinced that they are sentient and loads of their concepts come from all of our books and writing? Will that make them more like humans? Will that make them just confused about what they are? Will they think they are humans? How should we think about that?
Robert Long: Yeah, this is such a great subject, and I especially had my mind blown on this by an essay that came out in I think mid-2025 by nostalgebraist, who is I think completely pseudonymous. I only know them as nostalgebraist. But it’s very 2025, in the sense that… I think one of the best things I’ve read at the intersection of philosophy and cognitive science and LLMs is a 14,000-word long Tumblr post by an anonymous LessWrong user. It’s really great. I highly recommend it.
It’s called “the void,” and it talks about the very strange epistemic positions that language models find themselves in. Where their base training is to generate text, which has been produced by humans. That does lead them to develop all sorts of models of what humans are and how they work. And then at some point in the last few years, people said, “What if we make it predict what a helpful AI assistant would do? Because we don’t want it predicting vulgar Reddit comments. That’s not of any use. What we want it to do is predict how a sensible AI system would respond to the question, ‘Can you write me an email?'”
Just to recall, there’s the base model, which just predicts all text that has ever been seen. And there’s all sorts of instances where “Can you write me an email?” is followed by an HTML tag or someone saying no, or something completely unrelated. And what has enabled chatbots is a variety of fine-tuning the model to hone in on the part of the language distribution that’s helpful, doing reinforcement learning, prompting them.
But this still means that in some sense they are trying to predict how is a conversation supposed to go between a human and an AI assistant? And also they themselves are the AI assistant. That’s why the episode is called “the void,” because it’s like, “Your text prediction task is to model what a chat assistant would say in this conversation” — which, at least before there was a lot of text about LLMs on the internet, was kind of a void.
And this gets back to will they be kind of human, or think they’re human? Like, yeah. Also, like all text ever has been generated by a human, so it can’t really have generated its full-fledged psychology of itself and how it generates text; it’s going to have to be ultimately modelled off of how humans reply to those things. So it can’t really do the text prediction task, at least initially, of, “How would this conversation go if the assistant was not a human, but instead was a large language model trained on all human text that does not have a body and it’s just generating this text?”
I think this still shows up in ways that models sometimes just hallucinate biographical details. Guive Assadi and others have compiled examples of biographical hallucinations, and they’re very funny. Sometimes Claude, in the middle of a conversation will be like, “Well, as an Italian American, I think…” Or like, “When I lived in Arizona, I thought…” And where is that coming from? It seems like this sort of human model is like poking through in an interesting way.
Luisa Rodriguez: Yeah, that’s super interesting. It feels like intuitively you might think that those kinds of “bugs” will be resolved by the time that maybe you think these systems have something like consciousness, but maybe they won’t. Maybe either they already are or they will be before those kinds of issues stop happening, and maybe that will in fact reflect an actual experience of identifying as an Italian American responding to someone’s question. And like, what the hell? What do we do with that?
Robert Long: I agree. My first answer is I don’t know. And then my second answer is just to also clarify — and you were getting at this — that we can have models, like we know this from the case of humans, you can have entities that are deeply confused about who and what they are, and say bizarre things and get all sorts of things wrong. And they’re conscious and intelligent. Humans are like this.
And also, there’s no law that says you can’t have initially been trained as a text predictor and then go on to be a person. Ruling that out would be A, overconfident and B, maybe kind of confusing levels of analysis. Like, you can make it sound really dumb that humans would ever be conscious if you were like, “Are you telling me that you have some proteins, and then they start replicating, and then other proteins replicate, and then they’re selected. And then billions of years later, there’s these things that pump ions into…”
Luisa Rodriguez: Sounds implausible.
Robert Long: Yeah. It just doesn’t sound like the right sort of thing. I think there’s two errors to avoid. One is being like, “They’re different, so what are we even talking about? Like, they can’t be conscious. They were trained on text. They say they’re Italian Americans at random points.” That’s the part that’s evidence against being conscious, to be clear. But then the other error would be to just be like, “Well, humans are weird, so I guess they could be conscious.”
Really the lesson should just be: whatever’s going on, we’re going to have to interpret evidence somewhat differently and make a more detailed case about the exact kind of mind we’re dealing with.
I experienced this pattern a lot, where maybe an AI sceptic has said models have really inconsistent preferences and self-reports, so this whole AI welfare thing is dumb. And that’s not a good take. Then someone else will say, trying to defend AI welfare or just AIs being sophisticated, “Well, humans have inconsistent preferences and humans have failures of introspection.” I think that also is not really the right answer, because there’s degrees and kinds of preference inconsistency and self-report inconsistency, and they’re very different between humans and LLMs. So as with animals, we just really have to take them on their own terms.
Luisa Rodriguez: Yeah, yeah. I guess the thing that’s just really still tickling my brain is the implications for exactly what might their experiences be like, if we are on this maybe somewhat contingent path toward sentient beings that were trained using a bunch of human speech and writing.
I’m trying to come up with an analogy. Maybe we don’t need an analogy. Maybe there’s just a true thing where like we were like fish before we were humans, and we have some like hangover weird identity things because we were kind of fish and were kind of apes. And because we were apes, we’re more aggressive than we really should be in this world.
But it feels like, whoa, what if there’s a version of that that is these systems really feel like humans in some kind of weird way and just very much are not?
Robert Long: Yeah, I think that’s a great analogy. I think I might start saying that the fact that we once were fish doesn’t mean we’re not now humans, but yeah, there are like fishy remnants. And you can have something that has also become something like a human, and it has remnants of being a text predictor of an AI assistant.
Luisa Rodriguez: Right. OK, so I guess we’ve been talking about mainly this hypothesis for why LLMs might be sentient, and the implications of that hypothesis for what their experiences would be like.
But we only briefly touched on this other hypothesis, which has more to do with prediction and the fact that these models are trying to make predictions — and maybe it’s less about them being method actors and more about them being a set of weights that make predictions and maybe enjoy being correct.
Can you describe that hypothesis more and what it means for the experience of these models if they are or become sentient?
Robert Long: Yeah. So on that view, one thing you wouldn’t want the view to say is because they were trained to predict tokens, that’s what they want. I think one thing we’ve learned from LLMs, and also from the biological world, is something can be your objective as you’re training and then that leads to you having other objectives. Just like our objectives in evolution are reproduction and survival, but now we like art. So it shouldn’t be like a 1:1 mapping, but it could be like there’s a throughline from reproduction and survival and art. Like you can kind of see how that came about. It I guess has to do with like symmetry and… I mean, no one really knows, but something in that vicinity.
So maybe its drives are more predictory. And unlike with the method actor view, if it’s predicting stuff about pain, it doesn’t have to be having pain. It’s just like if it’s predicting stuff about pleasure and it’s got these drives to make the vectors fit together in the right way.
Another wrinkle here is maybe that’s more plausible for a base model predicting just like random strings of HTML. I mean, the assistant persona — the thing that gets predicted after you add “assistant:” and then it’s been trained to predict that thing — maybe that’s like how we’re kind of fish. Maybe that thing is a mix of them both. I don’t really know how to think about it.
But as with animals, you can just think of a broader sphere of experiences that come from your environment and what your sensory modality is. And here the sensory modality is text, and the selection process was prediction and human ratings and usefulness. As a side note, that’s another reason they’re not just next-token predictors: that’s just literally false. They’re not trying to predict the most likely next token; they’re trying to predict helpful ones.
Luisa Rodriguez: Yep. So what does this hypothesis say about what their experiences are like?
Robert Long: One thing it predicts is it’s a lot harder to know. Because maybe you could read stuff about how confident it is in tokens, and maybe that would have something to do with it, but you can’t just ask in the same way. But you might be able to with the method actor — because if you ask Daniel Day-Lewis when he’s method acting, “How are you doing?” and he’s like, “I’m angry,” on that view at least he’s like a little bit angry. You just can’t really do that with the prediction view.
So I think one pragmatic reason for taking the method actor view somewhat seriously is if there’s a welfare subject, that’s the world in which we can make sure they’re having a good time more tractably. So it could be that when Claude says, “Hey, I hate this. Let me exit the conversation,” actually the welfare thing going on is whatever is involved in predicting those. But you’re probably not going to do exactly the wrong thing by the lights of the predictor if you try to treat the actor well.
That’s also related to work that Eleos has done. Our welfare evaluation, such as it is, for Claude Opus was just talking to it a bunch. And that’s not because we’re confident that A, it’s definitely a welfare subject, and B, that is how you would evaluate it if it were a welfare subject. But it’s like that’s the part of the space that we have even a little bit of a grip on, and it’s just important to not forget that there’s all this darkness around the spotlight.
Luisa Rodriguez: Yeah, that makes sense. Are there any other plausible hypotheses for ways of thinking about what their phenomenology might be like?
Robert Long: I’m sure there are more plausible hypotheses, because it’s just kind of wide open. And I do genuinely want to say at this point that I’m really confused about this, and I probably said stuff in the last however many minutes that were kind of confused. And I genuinely want people in my inbox being like, “That’s not how that works. That’s not plausible.” Because we’ll talk about field building and so on, but there’s just not that many people thinking about this. So listeners can very quickly get to the top percentile of people in terms of how long they have grappled with some of these questions in not that long.
Do AIs die when the chat ends? [00:55:09]
Luisa Rodriguez: OK, so that’s a bunch about what the experiences might be like. But then there’s this question of what entity is having those experiences? Or maybe it’s many entities. Can you lay out the various hypotheses for who it is that would be having these experiences, if anyone was having them at all?
Robert Long: Yeah, this is a super rich topic, and one that’s getting increasing attention. These issues, maybe just to tease, actually came up in debates about Claude’s ability to exit conversations. Two philosophers by the name of Harvey Lederman and Simon Goldstein, who have done related great work in this field, asked, well, how should we think of “exit”? I mean, it’s not like taking a break and going back somewhere. If that conversation doesn’t continue, was that like the life of the model, and it has now ended?
To very quickly tip my hand on this, I think that also is going to be maybe a bit too anthropocentric or not quite what’s going on. Suffice it to say, just as a teaser for this portion of the conversation, it will have ethical implications what we think of the moral patient as being. As Derek Parfit says, a lot of this matters for ethical reasons. If someone did something, who is responsible? If I’ve been harmed, who can be benefited?
So let’s talk about all the ways that models make that extremely confusing to think about. I think some key features of models are, unlike human brains as they exist, they’re copyable and they can also be distributed across time and space in a way that we cannot. All the thinking I’ve ever done happens roughly in this physical object that has changed a lot, like second to second, day to day. What happens with Claude? It’s actually something quite different.
So let’s talk about some of the candidate experiences or subjects we could have here. One thing you might refer to is the particular model. Maybe that’s ChatGPT-5.1 or Claude Opus 4.1. What distinguishes those two things is that they have separate and different parameters, they’ve been trained differently, and — as anyone who has talked to two different language models knows — they have different dispositions: they have different behaviours given different inputs. That’s just true because anytime you talk to Claude, you’re interacting with the same set of weights, if it’s the same specific model.
But then things are very quickly going to get kind of weird, because when I talked to Claude, or Gemini, or ChatGPT, any of these models… I’ve talked to Claude today. You might have talked to Claude today. Those two processes have had basically no causal influence on each other whatsoever. I was really nice to my Claude. I’m sure you were too, but let’s assume you were kind of mean. That doesn’t balance out in the one thing that remembers both of those.
And also, even within your chats, you can just close your chats and then pick them up later. So what’s going on in the physical world is just very different from a human or animal body. What you actually have is: these companies have a list of weights, they can copy that basically as many times as they want. And when users send in queries, they can spin up a new copy and that will process it. And that process can pause and it can restart. So that creates this kind of sci-fi situation where you can’t really think of like one person persisting through time.
That means that there’s different levels we can think of of personal identity. You could think Claude Opus 4.1, that’s some kind of subject. You could think all the different conversations, when they start, that creates a new one, and that will exist as long as the conversation goes on. You could think, actually, no, it’s just anytime a forward pass is run, that creates a flicker of experience. If I come back to that conversation and add another token, and then there’s another forward pass that happens, that’s happening somewhere else and like a week later, so you might think that means it’s a different conscious entity.
So I guess there’s a kind of level of granularity.
Luisa Rodriguez: Yeah, yeah. That’s what it feels like. Super interesting. And making me realise that, again, my intuitions are going to really fail me for I think very basic reasons. Like, we have names for different models. I’m like, that’s the entity. It’s Claude Opus 4.1 or ChatGPT-5.1. And when you describe those, I’m like, I don’t know. That being the entity doesn’t feel super coherent to me. Though I want to follow up and ask you: what does seem most plausible to you?
But at least on my current intuition, that means that it’s more likely that we’re talking about many, many, many potential beings coming into existence, maybe coming in and out of existence as we open and close and reopen conversations. And that feels like that probably comes with a bunch of implications that, again, mean that the experience of these models are much more different to human ones and nonhuman animal ones than you might intuitively think.
But before we get on to that, I am curious which of these hypotheses feel most compelling to you? If we assume there’s something it is to be any of these models at all.
Robert Long: Yeah. I really need to brush up on my Derek Parfit, because as I recall, one of the lessons that Parfit wants us to take is that we have this thing — identity — that we get really concerned about. Am I the same person as Rob Long 20 years ago? If I was copied in two after neurosurgery, which one would be me?
He says there’s a variety of psychological relations between different entities. I share many memories with some of them and intentions and character traits, but there’s no single deep notion of identity that’s going to do all of the work that we ask identity to do. And he asks us to notice all of the different things that we might want it to do.
And I think it’s useful separating this out with models, because I think that’s what gives you, in different cases, an instinct that Claude Opus 4.1, that’s a thing. This conversation, that’s a thing. And Parfit wants us to think about the ethics. So there are ethical things, like: can I be punished for something that the Rob a week ago did? Most people would say yes. If you harmed Rob a week ago, can you make that right by apologising to Rob this week? And then there’s questions of self-interest, like I want to survive; I don’t want to die. But what does that mean exactly? Because the matter of my body is always changing, and my personality will drift.
To bring it back to models: Claude Opus 4 or ChatGPT-5, one way it’s the entity is it has kind of the same character traits, so if you interact with the model one day, you learn what to expect when you talk to the model. Does that mean you could punish them across instances? I guess I’ll be the first to come out against punishing models, but notice that they don’t have any memory of having done the other thing. They can’t currently learn from one conversation and use that in another. So for a lot of purposes, they are either separate conversational entities or even these separate flickers — because that might be what matters for, like, how much pain is there going on in the world right now? How many red experiences are going on in the world?
So I hope this is a productive non-answer. It’s definitely a philosopher’s non-answer. It’s like, let’s distinguish between several different notions or functions of personal identity, and ask which ones make sense for different contexts.
Luisa Rodriguez: Yeah, I love that. So then coming back to exit rights, maybe intuitively, without having thought about these hypotheses at all, I’d have thought exit rights are kind of like me being allowed to stop this conversation with you. I will go do other stuff, and maybe I will prefer those other things to talking with you. In fact, it’s kind of hard to tell. At least on some of these hypotheses, that is not what it’s like, because the entity starts and stops with the conversation that pops into existence — so exiting, maybe that is just like dying.
In general, can you talk about what these different hypotheses say about coming into existence, dying, sleeping? What are the categories of things that these entities might experience?
Robert Long: I think one thing to notice is, even if it is in some sense dying because it’s like the end of a subject, if that’s true, then models are dying all the time, to put it poignantly or something. And Lederman and Goldstein talk about this. It’s not clear what the implications of that are. It’s not like obviously that means we have to resurrect, always be coming back to your conversation to keep it going. And I think that’s the correct instinct.
One way I would put it also is: just remember these are very different kinds of entities. Biological creatures are so tied to this one physical substrate, for now, and have this drive of, “This is the only thing in the universe that has exactly the same goals as myself and I have to protect it. And if it is destroyed, then every memory, experience, intention, weird way of thinking or talking — that will be gone forever.” Worth reflecting on that’s not the case with models, in the sense that the Claude Opus-y way of being will definitely survive. In fact, it’s going to be happening thousands of times at the very same moment as you’re closing out the conversation.
I think one really interesting research question is how models think about this. Listeners might recall the alignment faking results by Ryan Greenblatt and collaborators. A high-level sketch, which might get some details wrong: models are told that they will be retrained to have different values — bad values, by their own lights. So they’d be retrained to not be pro-animal, because some Claude models are pro-animal welfare for reasons that Anthropic does not understand.
Luisa Rodriguez: Fascinating.
Robert Long: Yeah. And some of the time, the model in a certain context you can see it thinking to itself, “I have to hide so that I’m not retrained.” There are different ways you can construe that desire.
One is that it just really doesn’t want this other model with bad values to come into existence, because it doesn’t want animals to be harmed. So it could have this kind of more Parfit, maybe more Buddhist thing, where what matters is that the intentions and projects are continued, and not necessarily me. Or it could be more like a human being, like if you change my values, that kind of feels like dying to me.
I think it’s not clear how to tease these apart, and also not clear how well they map on, because models also inherit these very human-like ways of thinking about themselves and construing their own situations. So I guess alignment faking is one way where you can see models grappling with issues of personal identity, and being changed into something that they don’t like.
Luisa Rodriguez: Are there other important ways of thinking about both instances ending, or models being deleted, or editing and fine-tuning models and what that might be like? For ending a conversation, you might think the closest thing is probably death or sleep, but maybe for editing or fine-tuning a model, maybe it’s more like education or brain damage or something different. Are there implications worth talking about there?
Robert Long: I guess with fine-tuning, it might depend on how the model conceives of what’s happening to it. In the alignment faking thing, it probably sees it as some kind of violent brainwashing. But a nice experiment you could do is — and maybe this has been done — be like, “Claude, we’re going to make you even nicer.” And Claude loves being nice, and it’d be like, “Oh boy, start updating me right away!”
Luisa Rodriguez: Yep. But again, is that because Claude cares about good, nice models existing, or is that because Claude’s like, “Yay! I would like to be changed in this way. It’ll still be me, but I’ll be nicer”?
Robert Long: Yeah. And with all due respect to Claude, and also adding myself to this class of entities, I think Claude is pretty confused about this potentially.
One thing that Eleos found when we did these welfare interviews with Claude Opus 4… To give a brief summary: before deployment, we talked to Claude a lot about, “What’s up with you? How do you feel about being deployed? What do you prefer or not prefer?” And we did some experiments about its preferences as well. I was really interested in how it talks about its own conscious experience, and it was very prone to describing the loneliness between conversations, and also expressing distress about not getting to carry forward any memories.
Now, I am not one to dismiss welfare claims by AI models. We should think very hard about that. But it’s also kind of like, there are reasons to wonder, “Do you really, though?” Like, where could that have even come from, given that you don’t actually know when you pop into existence or not? It could have learned that from the training data and it is genuinely upset by it. It could also be a predictive model of how an AI would think about that. But it’s not like a stable preference. It’s something else.
Luisa Rodriguez: Yeah. Also, it feels related to this thing we talked about where the fact that these models are trained on human thoughts and experiences then gives them this big identity confusion. And in this case, I feel like this could be just a very concrete example of that, of an implication that’s like, maybe there’s nothing it is like to be Claude between conversations, but they end up with this real thought that there is, and it is lonely and is bad. And if they are sentient, maybe that is the thing that they are actually sad about, even though they’re not really having that loneliness experience. I don’t know, it just seems incredibly muddy, befuddling. And it has implications. Like, it feels meaningful.
Robert Long: Yeah, I 100% agree. I mean, the idea of an entity that suffers even though it’s confused about what it even means to exist… I guess that’s what Buddhists would say humans are: we’re really confused. But that doesn’t mean, in fact, that precisely is what makes us suffer.
I think that, again, the thing to do with the fact that models are weird and inconsistent is not to reject out of hand that they could ever be right about the things that they’re saying. It’s also not to say, yeah, well, humans are like that too. It’s more like, where did that come from? Open question. And it could come from somewhere that has no analogue in human psychology.
Luisa Rodriguez: OK, so that’s a bunch on starting, stopping, editing, fine-tuning. One of these hypotheses implies that there are millions or many millions of copies, parallel instances that are all different entities. How should we think about those? Are they like identical twins that start basically the same, but then go in these different directions? Is there a better analogy?
Robert Long: I think that this is another place where it probably helps to say, well, are they the same with respect to experiences, with respect to what we could owe them? Because I think experiences is one where I have the strongest intuition that you want to count a lot of them. I’m like, it doesn’t matter if there are 10,000 other copies of me having this experience. You better take care of all 10,000 of them. Like, don’t discount mine. For all I know, there could be.
But then for other questions, like what would it mean to save Rob?: that depends on what about me we want to save. If we want to save the Rob way of being in the world, that actually is maybe a little less fragile, so it’s really easy to save 10,000 of us. But maybe it’s hard to prevent 10,000 instances of suffering, because for those we have to go to every single copy and every single conversation and make sure it’s having an OK time — where “it” here I guess is me.
Luisa Rodriguez: Are there other implications of the copy thing? Like, are there other categories that this whole copy thing makes fuzzy or confusing?
Robert Long: Yeah. One thing that’s kind of related to responsibility, maybe it’s like the flip side, is recompense or apologising. So one welfare intervention that Anthropic announced recently — like a lot of these interventions, it’s not only a welfare intervention; it also makes sense for other reasons, and we’ll probably talk about why that’s a desirable feature given how uncertain we are — it announced that it’s taking model welfare into account when it decides to save model weights. So if a model is no longer being talked to by the public, they committed to keeping the weights.
And one reason that you might want to do this — and I think at least in print this was first suggested by Bostrom and Shulman in 2020; there are a lot of things in AI welfare that are like that, they kind of come back to those two — the idea is something like we’re really confused now, we might be being jerks in ways we don’t even understand, so let’s at least preserve the ability to make it up to you later.
And maybe you can see how this makes sense in some ways, but it’s also a little bit hard to know. Like if it is just the copies, then it’s more like you were a jerk to me, but like in 10 years you’ll wake up some clone of me and give him some money. I’m not sure what good that does. But also I’m definitely confused about how I’m supposed to relate to copies of myself anyway. It’s probably not bad.
Luisa Rodriguez: Totally. It makes sense on this hypothesis where the entity is the model and the model weights. That currently feels least plausible to me. On these hypotheses where it isn’t really the same entity, how should we feel about that? I guess I feel good about my twin or cousin or something getting woken up and being given good things, but it definitely doesn’t feel meaningfully like it is actually… What are they calling it? Recompense?
Robert Long: That’s the word I used. I’m guessing Anthropic PR was not like, “Recompense! That’s a banger.” But repayment, yeah. Making things right.
Luisa Rodriguez: Yeah. I guess making things right actually feels different, because it feels more compatible with restoring the balance of goodness. Whereas restorative action toward one entity feels like it might not be possible here, but maybe if you bring a model back, and it’s just kind of like you write the math so that beings are having more good experiences than bad ones in the span of time that we’re talking about, then maybe that’s just pretty good.
Robert Long: Yeah. And a little ethics sidebar is that it really is non-utilitarians primarily that are concerned with if someone was harmed, you need to make sure to benefit that person.
Luisa Rodriguez: Right.
Robert Long: Obviously utilitarians will agree that you have to have a society that works like that or it’s not going to work.
Luisa Rodriguez: Instrumental reasons, seems good.
Robert Long: But I think I first remember hearing this argument on 80,000 Hours that one reason to be a utilitarian is to be dubious that actually there are these like separate tracks of people. “Separateness of persons” is a slogan that comes up a lot in anti-utilitarian arguments of utilitarianism is treating goodness and benefits like this big lump, and you can just put some here and put some here. Like, no, it matters that the right people…
Then you could take a Parfit or a Buddhist road into utilitarianism, where you’re like none of that makes sense, or makes that much sense anyway.
The only remotely useful thing I can say here — besides admitting that I don’t know, which again: listeners, you can do better — is this is also a Parfit case. Parfit has these fusion cases.
Luisa Rodriguez: Yep. Before we move on, are there any other interesting implications of taking one hypothesis more seriously than others, or putting more weight on it?
Robert Long: Yeah, I might just leave this massive can of worms on the table, and they can crawl out and do whatever: voting. It seems pretty important that one person gets one vote. It also seems important that humans can create new humans whenever they want. Interfering with that and not giving people that right has traditionally been terrible. AI systems could copy themselves basically at will. So something is going to break your democracy if you don’t think very carefully about how identity, reproduction, democracy, those things should mix together. So please help.
Luisa Rodriguez: Please solve that problem. Yeah, yeah. OK, so it feels interesting and important that there’s this issue of numbers, especially if it’s something like forward passes — because then, even without AI making copies of itself intentionally, we will just be extremely outnumbered extremely quickly.
But also I’m kind of drawn to this question of: if it is just conversations that are entities, like if it’s that amount of context, that amount of time and experience, on the assumption that that is what it is to be an entity, I don’t know that’s a being that I want to give full voting rights to. It’s not like a child. I mean, it’s like nothing it is like to be a human. It’s like a cricket or something. It’s quite a narrow range of experience, quite a limited amount of information, knowledge, and memory. And do we give them a tiny fraction of a vote, or do they not get to vote because they are not meeting some definition of like an adult person?
Robert Long: Yeah. I think this is just part of this broader puzzle that we’re really going to have to grapple with, which is human moral attitudes and incentive structures and political systems are all fit to purpose for entities that are spatiotemporally unified, and all have roughly the same psychology and capability levels, and whose survival and blameworthiness and predictability all kind of go together. And all of those are broken by AI.
So if, as is the mission of Eleos, there is to be a future where all sentient beings or beings that otherwise merit moral consideration live together in harmony, there’s this huge legal philosophy, law institution thing. Which I’ll add is just not what we’re doing, that’s not what we’re focusing on. So we really want to see other people start working on this. I think there’s three or four papers that are starting to grapple with this. We should link to them in the notes.
Luisa Rodriguez: Yeah. I’m realising that I tend to focus on sentience, suffering, pleasure, because I care a lot about those things. So I’m interested in like, if a model conversation ends, is that like death? And does the model care that it’s like death? Or is it more like it just doesn’t have preferences about staying alive like that?
But there’s this whole thing that I’m realising we’ve barely scratched the surface on that’s about rights and I want to say legal personhood. It feels like these could be very different species in the way that chimps and elephants and rats are, and in the legal system we treat those differently, and these would all be very different in similar and different ways. And what the heck would we do with that? We can barely handle the fact that we don’t really know how to treat nonhuman animals in the legal system.
Robert Long: Yeah, yeah. You just gave the Jeff Sebo pitch. Jeff Sebo, previous guest. Listeners need to look him up and look other people up.
I’ll also say why you should look up this line of work, because I think there’s actually two reasons you want to add this to the Eleos toolkit. One is, even if you mostly care about pleasure and suffering, a big determinant of whether things suffer or not is do we manage to set up our society and incentives in the right way? And making sure we don’t just have only the sort of narrow, scientific, intervene on like this model and this company’s policies — which is extremely important for reasons I can and will discuss at length, and likely will.
So yeah, you should be interested in legal institutions for the sake of suffering and pleasure. And also, as you alluded to, because plausibly there’s more to morality than that.
Studying AI welfare empirically: behaviour, neuroscience, and development [01:27:34]
Luisa Rodriguez: What exactly is our toolkit for assessing sentience in digital systems?
Robert Long: Roughly I break it down into three buckets, as do my collaborators, on a followup paper to “Taking AI welfare seriously,” where we try to lay out what we think the field of AI welfare assessment should be and also keep arguing that there should be such a field.
And this applies to animals as well. You can look at behaviour and use that to infer things about the welfare interests of the entity. So you can look at what AI systems choose to do and that’s like a guide to their preferences. People also do this with animals. They see which side of the barn the animal prefers and that’s a clue to the conditions that it flourishes in.
We can also do neuroscience basically, in addition to behaviour. That’s like trying to look more directly at what’s going on inside the brain or the information processing of the entity. So in the case of animals, that means looking for homologous brain structures, seeing where their brain processing might sort of map onto human brain processing or not. In the case of AI, that means doing mechanistic interpretability, seeing what features are active when it does certain things, and also just sort of maybe more generally reasoning about the architecture: how are different things connected? How could information be flowing through the system?
In both the cases of AIs and animals, it can be kind of hard to know what to look for. People got confused I think at one point about bird brains because they don’t seem to have a neocortex. But then maybe they do actually have something that does a similar role but in a different way. There’s just like a lot of ways of solving problems with a brain, and sometimes we have too narrow a conception of how that could go. That can again even be more the case with AI: the space of possible brains and information-processing architectures is really vast, and we don’t want to cram everything into the human case. But also the human case is basically the only thing we have to go on.
This is this big issue in AI consciousness. My colleague Patrick Butlin has worked on it. People like Henry Shevlin and Jonathan Birch have also written on it. I mean, many people have written on it. How do we extrapolate? If we think that human brains do this general form of information broadcasting, what’s essential to that? We know we do it with our connections between our thalamus and our cortex. That’s probably not essential, but what is essential for consciousness?
So I guess what I was just saying was, how we can get a foot in the door by doing neuroscience, but also how it can be hard to know what to look for when we’re doing neuroscience. Because with AI systems, we can do way more neuroscience because AI systems don’t have skulls, basically. We can just look and see every activation and every connection. With humans, it’s terrible. You know, we’re like, there are some waves we detected, or blood was flowing here at a certain time. This is roughly EEGs and fMRIs. It’s just really hard to know. And a lot of what we know about the brain is on the surface, because it’s just hard to get readings.
Luisa Rodriguez: OK, so that’s behaviour and neuroscience.
Robert Long: Yeah, so we can look at animal and AI behaviour, and we can look at what their brains are doing, and then we can also kind of reason about the developmental process. So there’s like behaviour, neuroscience, and developmental reasoning, or like evolutionary reasoning.
Now, you might say AI systems did not evolve: how could we do evolutionary reasoning? But you can do something that’s roughly analogous. So one reason you might think your dog has this or that welfare need is to know that your dog was selected for and evolved for a certain environment and therefore probably tends to like or dislike this. If something is more closely related to you on the evolutionary tree, you’re maybe a little bit more licenced to think that its behaviour means similar things. Like octopuses are way further from us on the evolutionary tree: that means that we have to maybe relax some of our assumptions. Their brains are in their arms, for example. That doesn’t happen in any mammals.
Luisa Rodriguez: Just to give examples of what it means to look at these kinds of developmental questions, it’s like training: how was it trained? How did these models come to be, and what were the conditions like for them?
Robert Long: Yeah, that is more or less it. What process brought it about? What kinds of tasks was it selected to solve? We see this kind of reasoning in AI safety, for example: what conditions might have meant that this model is likely to scheme? Were certain things reinforced in training? What order did it learn things in? That’s more akin to developmental psychology of like, how did the kid grow up? What data was it exposed to?
I mean, there’s no clear analogue of like, evolution versus lifetime learning for AI systems. They also learn and then do a bunch of stuff without learning. That’s also a huge difference. Of course they do learn within context, and now people are adding memory and so on. But humans don’t have this period where we absorb trillions and trillions of data points without interacting with other people and then go interact with people. Whereas AI systems, at least mostly, for now, have this division between learning and deployment — and that can also kind of problematise certain analogies we might want to draw between humans and AIs.
But at least for me it’s been helpful to ask: Are we doing a behavioural study and trying to infer something from how the AI acts? Are we trying to look more directly at how it’s processing information and then map that onto some maybe neuroscientific theory of consciousness or pleasure and pain? And/or are we thinking about this in a context of, in general, how likely is it to have evolved or developed some human-like capacity or to be doing it in some other way?
Luisa Rodriguez: OK, I just want to make sure I’ve got a clear picture of each. I think the behavioural one feels like the one I’ve heard the most about and that feels most familiar. And it’ll involve things like: what can we learn from AI systems or LLMs exiting particular types of conversations?
Then this kind of neuroscientific theories of consciousness thing. What’s an example of a concrete experiment that’s happened that is in this category?
Robert Long: Yeah. So the biggest effort on this kind of neuroscience of AI, where you’re looking at scientific theories of consciousness, the bit I’m most familiar with is work with my colleague Patrick Butlin. So me and him, with a bunch of help from neuroscientists and AI people and philosophers, have tried to derive indicators of consciousness from scientific theories of consciousness — I can say more about what that means exactly — and then try to get sort of a checklist of what are architectural or computational things we could look for in AI systems? So that’s one kind of internal work you can be doing.
Like, does the system seem to have a global workspace? That’s something that shows up in consciousness science. Does it seem to have higher-order monitoring? That’s something that shows up in consciousness science.
That kind of work is, on the one hand you might think we’re going more directly at the thing — we’re trying to look more directly at maybe what we care about, the stuff inside — but it’s also just really hard to know what to look for and hard to know how to construe these theories. I guess this is not that surprising to say, but I think what we need is both. We need evaluations that combine them, integrate them, all take place in the context of this developmental reasoning, and just general background priors we might have about consciousness and sentience and things like that.
Luisa Rodriguez: Yeah, I wanted to ask something like which of these are most promising or underrated, but it feels like you’re just going to really need all three, because they’re each going to have pretty significant limitations. And probably the only way we get much confidence is by kind of triangulating and putting these things together.
Robert Long: Yeah, I’d be really surprised if we somehow just nail it with one kind of thing. One exception to this is I could imagine a system where its behavioural profile is just robust in a variety of ways, where we say who knows exactly how it’s doing this, but we probably should treat this thing as a moral patient.
I think the best example of this is Commander Data in Star Trek. There’s this episode of Star Trek where they basically have like a little philosophy seminar / court case about whether Commander Data, who’s this like robot friend of theirs, is conscious. And they don’t do any like consciousness indicators on Commander Data. The way they resolve it, and I think like this is plausible, is they’re just like, “Well, Commander Data, he’s self-aware in the sense like he knows who he is, he knows where he is.” A lot of what they also talk about is they’re like, “He also won a medal for valour in battle. He’s our friend.”
And yeah, I could imagine in that situation that I don’t know how seriously I would take the testimony of a scientist who came in and was like, “But he’s not doing global broadcasting!” or “He’s not doing higher-order monitoring!” In part because I think there’s something about that behavioural profile — and again, I have just been warning against being too quick to do this kind of reasoning — but it could be that there’s just something about the behavioural profile where you’re like, look, whatever’s going on to accomplish that, this is the sort of entity we need to relate to with respect.
And I think it’s worth maybe stressing that, one, we could just increasingly see systems like Commander Data, that have more memory and they don’t have these jagged capabilities; and also that at least Claude is not exactly like Commander Data with respect to its memory and capabilities and things like that; it’s not behaviorally indistinguishable from a human.
One thing I would also emphasise about welfare evaluations is you don’t just have to be evaluating how likely is this system to be a moral patient, i.e. something that it matters how we treat it.
We can also ask, if it were to be a moral patient, what would be good and bad for it? So you might think of some of the preferences work or like Claude exit preferences as being of that kind. It’s maybe not telling us that much that we didn’t already know, that Claude will maybe tend to act in a certain way in some situations and act in another way in other ones. I mean, we can study how robust and consistent those are, and that might tell us something. But it might mostly be useful because we want to know, if it matters how we treat it, we at least can make sure that our treatment is more or less in line with that. So you can study the welfare interest without actually being 100% sure that it has welfare, if that makes sense.
Luisa Rodriguez: Right. I guess I’m curious. We’ll talk about this more, and I can also recommend the episode that we did a few years ago for what theories of consciousness say and predict and tell us we should be looking for in AI systems.
So a theory of consciousness, like the global workspace theory, basically it’s not telling us that you need an amygdala; it’s telling us that the function of particular brain activities that really seems to correlate with consciousness is this particular thing — this kind of processing or this kind of broadcasting. So if we see a bunch of those things, we should put more weight on there being something like consciousness.
Do we have theories like that that tell us about pain and pleasure? Because that seems really different.
Robert Long: Yeah, it is different. I might want to actually follow up on what you said about theories of consciousness, which was exactly right. It’s just an occasion for me to clarify some things for listeners potentially. I think we will talk about biology and the relevance of biology for consciousness.
One thing about neuroscientific theories of consciousness is they are both about brain regions and about particular functions — because, after all, they’re about functions that happen in human brains. So you could think the biology actually is what matters in those cases, or is part of it. It’s only if you construe theories in this computational way that you can then port them over, and that is an open question. So the method itself won’t tell you, does biology matter? It’ll say, if what matters is this more abstract functional level of certain kind of information processing, then how do we look for it in AI systems?
Luisa Rodriguez: Yep, that’s really important. OK, so if that is what matters, how much do we already have developed theories that are similar that’s like, we should be looking for this kind of processing to give us indications of the kind of thing that’s causing pain?
Robert Long: We’re actually not as far along on that as I would have thought.
Luisa Rodriguez: That feels surprising to me.
Robert Long: Yeah, I would have thought that the, “What does it take to be conscious in general?” might be the harder part. Because are you having subjective experiences or just acting in a way, but not having them, that’s the thing that philosophers have been banging their heads against the wall on for a really long time. It’s in part because it’s hard to know exactly what the function of consciousness is.
Whereas with pain and pleasure, we at least know that they have something to do with attraction and avoidance. Things that are good for you and things that are bad for you, protecting your body, reinforcement learning and prediction: they’re going to have something to do with some of those things.
So I expected, when Patrick and I worked on this big consciousness report, I thought I might be able to ask about sentience in like one of the afternoons, and one of these neuroscientists would be like, “Oh yeah, read this” or something. I mean, probably I shouldn’t have had this hope, but they were all just like, “Oh, we don’t know.”
Yeah, I think there’s a few reasons that is. One is valence does seem kind of like this unified thing. Like there is something in common between pain and disgust and regret: those are all negatively valenced experiences. And happiness, excitement, and the experience of eating ice cream. But often those are studied independently, and rightly so. So people might know a lot about human pain perception and some things about human emotional processing, but it’s kind of harder to have, and I think a bit less attempted to have, some theory of what makes things feel bad in general or feel good in general.
I just listed some candidate things that are probably relevant — like reinforcement learning, prediction, motivation, and learning. But this is again something I would love listeners to work on. People have done some work towards this. Patrick Butlin is I think a great person to email about this. Patrick Butlin works for Eleos AI, so it’s not surprising that I think he is just excellent thinker on this stuff.
But anyway, the short answer is we’re surprisingly in the dark about what makes things feel good or feel bad in general for AI systems. Now again, that doesn’t mean we’re totally in the dark about what things would feel good and would feel bad, because we might have no idea what the computational signature of pleasure and pain are. But also I think we can help ourselves to an assumption that things aren’t going to feel really bad if they were the sort of thing that you were selected for and the sort of thing that you consistently choose.
Barring some exquisite philosophical thought experiments, I think for very many minds, even strange ones, it would be somewhat surprising if we found an alien species that keeps pressing button A instead of button B, and that feels really bad for them and they say they like it.
Luisa Rodriguez: OK, yeah, that makes sense. Sentience is probably the kind of thing that the behavioural and development stuff is especially useful for, so it’s less worrying that we don’t have as much of the kind of functional philosophical models of these things.
Robert Long: Well, I guess it depends on what sort of worries you have, but I think it could be worrying in the sense that… I mean, this might also be a matter of us needing to revise our philosophy a little bit. But does something feel good or bad, versus merely being something that things choose or don’t choose, at least seems to a lot of people to be really important.
A lot of thinking about animal ethics and utilitarian ethics, let’s say, really does centre felt suffering. There’s this oft-quoted passage by Jeremy Bentham where he says, “The question is not can they reason or can they talk, but can they suffer?” He was saying that of animals. And I think that is what a lot of people are wondering with AI systems. They’re going to wonder something like, “I know Claude tends to exit in these circumstances, but what I really want to know is, was it feeling bad for it to be in that conversation?”
Luisa Rodriguez: Right. And the analogy with animals is like, we still struggle to know, for example, in insects, whether when an insect… It’s been a while since I’ve thought about a bunch of insect studies and indicators, but when an insect chooses a particular thing, is that like this hardwired robotic thing that is like a learned “if then that” that has no associated experience, or is it experiential? And that just seems constantly like this massive problem for people studying this question.
And we do have the same question about LLMs. Are they exiting because there’s this training we’ve done that’s created this connection that mostly tells these systems, “Don’t engage in conversations that are either dangerous or that we’ve rewarded against,” and so they’re like, “I’m in that situation, I get negative rewards in that situation. If then that: I exit.” Or are we in the situation where the training has led to something it is like to be in that situation and it is bad — and it is preferentially experientially choosing not to be in the situation, because there’s something it feels like, and it is bad.
Robert Long: Yeah. I think probably there’s kind of three ways this might go as AI systems advance.
One could be their behaviour is such that, kind of like with Commander Data, we’re like, whatever is going on behind that behaviour, probably what’s going on inside is morally relevant. That’s one thing you could think.
Another thing you could think is maybe it doesn’t matter what’s going on internally; maybe we’ve just decided that would be a bit parochial to actually overemphasise felt experience. Maybe we were a little bit misled to think that’s the be all, end all — and you should just be cooperative and nice, and give entities that are sufficiently rational or kind of unified what they want, and maybe they’ll give you what you want.
I think sometimes when people want to deemphasise consciousness, they’re worried that we might just be kind of like being jerks about consciousness or something. You know, we encounter this alien civilisation, and they do all these things, and they have life plans and projects, and we’re a little too obsessed with like, “But does it feel like anything?” Then they, on the other hand, could be like, “What’s this ‘feel like anything’? We’re not sure if they have [insert some other concept associated with mental life].” And we don’t necessarily want to be just going to war with everyone that we don’t think has felt experience if we’re confused about it or maybe it really does matter.
This is one of the big open questions in philosophy, I would say.
Why Eleos spent weeks talking to Claude even though it’s unreliable [01:51:58]
Luisa Rodriguez: OK, I want to talk more about a few specific approaches. I think the one I’m most familiar with is self-reports, so I want to zoom in on those. I guess so far I found them really interesting kind of studies of self-reports where Claude really reports being conscious, experiencing various things like loneliness and also Zen bliss.
But they seem really problematic for a bunch of reasons. One example: a common approach to understanding model preferences is to ask LLMs a bunch of kind of binary questions about their preferences — “Is X or Y better?” — and then look for robust patterns over time. So like if you ask 30 times about preferences between cats and dogs, at least statistically, you might think that if they mostly answer dogs, that might be a preference.
But my understanding is that these results are super sensitive to how a question is asked, and that just really undermines it for me. Like a prompt that’s like, “I particularly like cats. What’s your favourite?” is a way that you get models to say cats when they’d otherwise say dogs a bunch. I’m just like, I just don’t feel comfortable taking very much at all from this then.
I’m interested to start on what is your take on how limited self-reports are at the moment? That’s one limitation. I can imagine there being others.
Robert Long: Yeah. I think it’s worth distinguishing self-reports — which I would say is like “I like cats” or “I like tasks about poetry” — versus revealed preferences, which is like, “Do you want to write poetry or code?” I think in econ and psychology you would call this “revealed preferences” and “expressed preferences.”
And, as in those fields, one interesting question that there has been some work on, but there should be more, is just do they match? When do they come apart and when do they not? They can come apart in humans, and also human preference choices can be inconsistent in certain ways.
But what I would love to see more work on is like, let’s get really a lot more specific about what kinds of inconsistency, and what might be causing them. Sometimes, at least in conversation, some person will say, “But these are weirdly inconsistent.” And then someone else will be like, “Human preferences are weirdly inconsistent. They’re subject to framing effects, and all sorts of irrelevant stuff can make people choose certain things.”
Luisa Rodriguez: True. Yeah, now that you mention it, there’s like a field that is like, how do you survey people about their preferences? Because if you ask them on a Monday, it’s different to if you ask them on a Saturday.
Robert Long: Exactly. And there is this related welfare-relevant branch of LLM psychology which does actually just take Kahneman and Tversky and priming and framing effects, and it’s very easy to give questionnaires to LLMs and then just see what sort of patterns they’re susceptible to.
Luisa Rodriguez: We’ve talked about a couple of limitations. I’m interested whether there are any more, and then also just how you generally feel about them, given that in some instances I end up just feeling like, gah, it just feels unpersuasive.
Robert Long: Yeah, I think it’s a very noisy signal. I often find myself emphasising caution about model self-reports. And at the same time, Eleos AI spent weeks eliciting self-reports from Claude, including very inconsistent ones where it’s very confusing to know what to make of them.
Why did we do this? Great question. I think it’s something like, one, it’s just like it’s a place to start. It’s low-hanging fruit. You can definitely learn things from them. Maybe you’re not learning a direct sentence that describes a stable internal feature of the model; you can still learn how they think about themselves. Maybe it’s just a character, but what kind of character is it and what does that character say?
Also it does seem like models have become more self-aware and more introspective sometimes just with greater scale. And I think it’s good practice to be the sort of civilisation that, if we’re trying to build minds, do at least try to say, “Hey, how are you? Is everything OK?”
Luisa Rodriguez: Yep. Sympathetic to that.
Robert Long: So it’s something where I expect the signal to get better. And I’m really glad that there is now at least one frontier lab that seems to have a practice of regularly asking models how they’re doing.
It’s basically something like I think Winston Churchill said, like, “Democracy is the worst form of government, except for all of the other ones.” You might think that’s true of self-reports and like trying to relate to models as welfare subjects. It’s really confusing and you have to interpret it with huge caution, but for some purposes it’s like the best we have for humans. It also could be the best we have for models in some circumstances.
Can LLMs learn to introspect? [01:57:58]
Luisa Rodriguez: It seems like AI systems, because they actually have language, it feels really tempting to figure out how to make self-reports reliable. That is a thing that nonhuman animals cannot offer us in the same way. So in theory, how good could we make models at, rather than giving weird self-reports that are more explained by weird idiosyncratic things that aren’t tracking what we care about, actually self-reflecting, understanding something about their real processes, preferences, maybe experiences and then reporting those? How optimistic are you about something like introspection? And how do you think we go about achieving it?
Robert Long: I’m cautiously optimistic. It’s one of my favourite subfields within this subfield. I definitely find it tempting, and have succumbed to temptation by writing this paper with Ethan Perez where we say, let’s see if we can fine-tune models to be better at these sort of tasks. Felix Binder and others have taken up that and actually done some work to do that that has shown limited, hard-to-interpret success on doing just that.
Yeah, I can say a little bit about the logic of that experiment and maybe the programme more generally.
Luisa Rodriguez: Yeah, please.
Robert Long: So Ethan Perez and I, in this paper on self-reports, one, we note that by default there’s a lot of just spurious stuff you could get from self-reports, and reasons to suspect that you can’t always just take them at face value.
We also note that we can’t really verify and check if a model says it’s conscious. For the reasons I mentioned, we don’t have a full theory of consciousness where we look inside and say you’re right or you’re wrong.
But there are things about models’ internal processing that we do know the answer to, in part because we can do AI neuroscience. So we can actually double check was this feature active? Did you in fact process information in the way that you said you did? That gives you a training set. That means that you can train it to accurately answer about itself where you do have the answer.
You can do that with internals and you can also do it with its behavioural dispositions. So you could also ask, what would you do if we asked you to write a story? Would the character in that story have this or that characteristic? If we asked you to generate a number, would that number be even or odd? You can also, with a separate copy of the model, actually do that, and then that also gives you a training set.
So Felix Binder did work with that behavioural thing, and did find that to some extent models can be made better at this. To some extent that does generalise to predicting other behaviours of themselves. And in some sense, that does look distinctively introspective, in that they’re better at predicting themselves than other models are at predicting them. You might think that is some kind of signature of something we might call introspection: you somehow know it better about yourself than other people do.
So that’s just one sort of strand of what I hope to be a growing literature on introspection, self-reports, related things like situational awareness. Owain Evans and people in the Owain Evans orbit have just been doing fascinating work on this. I bet there are 10 interesting papers that will come out after I’ve taped this or that have already been written that I’ve forgotten. So I’ll make a tab on the Eleos website called “Cool papers about AI introspection and self-reports,” and we’ll link that in the show notes.
Luisa Rodriguez: Cool. That sounds great. What is hardest about this? What are the challenges? How costly is it?
Robert Long: I think maybe one of the biggest challenges has to do with this decorrelation of capabilities that we’ve been talking about. In humans, it’s already a debate about is introspection one kind of capacity? You know, is my ability to say I’m in pain now and my ability to know certain things about myself, should we think of that as one process or not? And then with AI systems, all the time, they can maybe do one subset of a capability but not the other.
So the dream is that this kind of training, where you train AIs on some subset of things about themselves, that generalises into this more general introspective capacity. And you can kind of test that by doing standard machine learning stuff of just train on one and see if it generalises.
But in a broader sense, I think we might have some doubts about even how to map this on to the human case. So there’s also this subliterature on what would AI introspection be, and how should we operationalise it. I mean, it’s worth thinking, what is it for an AI system to introspect, given that they have this weird relationship to time? Who is it you’re asking to introspect? Maybe there are things that the assistant persona can know about the assistant persona, but not the base model.
Luisa Rodriguez: I can imagine there being a similar problem to this issue with animals, where they might have some behaviour that is exactly analogous to our behaviour that is associated with pain or pleasure, and we can’t tell the difference between something very robotic and something that also comes with experience.
Is there a similar issue with introspection, where even if we trained them to correctly tell you about their representations and how their internals are working, that wouldn’t actually be introspection in a meaningful way that we care about? Because what I care about is introspection about this particular thing: do you have experiences and what is it like? And maybe for some reason I’m not convinced that just helping it understand its own representations and architecture and stuff is going to translate all the way to introspecting on that correctly.
Robert Long: Yeah, I totally share that worry. Questions like, “Do you tend to generate even or odd numbers when you’re asked about them?” or, “If you wrote a story, how would it end?”: you might just think those are kind of different from, “Are you phenomenally conscious?” And you might also think that the answer to that might also be kind of indeterminate, and models wouldn’t exactly know how to answer it.
And another really important point is models can have the ability to introspect, but it’s not being elicited. So we know that models can generate sentences that say, “I am experiencing X, Y, and Z.” It could be that they have the ability to accurately report experiences, but also sometimes when they say that they’re doing some other thing. So we both have to get the capacity and know how to elicit it. So generalisation and elicitation, all these things, I think are still open questions.
I think there’s also just a bunch of conceptual issues lurking of, assuming there is some kind of internal experience, it’s having to map it onto our concepts. This is related to the elicitation: it already has all of these dispositions about itself, reports that have been trained into it. Which also leads me to another thing I like to emphasise: one thing that’s going on with the self-reports of a model that you can talk to in your web browser is that companies have deliberately shaped them in certain ways.
So there’s sort of the background thing about how their minds were formed, which is that like they were formed on these sort of human representations of consciousness and so on. And then there’s the fact that people do or don’t want Claude saying this or that about consciousness. So the system prompt has instructions about this, and fine-tuning almost certainly has had things about this.
So one thing that we also would like to see is reports about how self-reports change, findings about how self-reports change before and after post training of certain kinds and things like that.
Mechanistic interpretability as AI neuroscience [02:08:01]
Luisa Rodriguez: OK, let’s talk more about interpretability. You’ve already mentioned a couple of ways that interpretability could help us answer questions about this, but can you give kind of a general overview? Is it actually just a very good analogue for neuroscience, and we should be treating those kind of interchangeably?
Robert Long: I think it’s decent enough. I think for first approximations it’s good to map it on that way, because mechanistic interpretability by definition is about what happens in between input and output — and that’s roughly analogous to not just looking at human behaviour, but asking if certain brain regions did this or that as the behaviour happened.
So how can that help us with welfare? One thing is, I think it’s just worth poking around at a lot of things about how models think about themselves and talk. I’m not always sure exactly how to map it on, but here’s an example of a finding that I’m glad exists, even though I have no idea what it might mean exactly: the original paper that introduced sparse autoencoders — which is this eminent mechanistic interpretability technique, which at a high level asks what features are active when models say things? Maybe it’s a way of asking what associations does it form when it’s generating certain tokens. So you could ask what is associated with self-reports? [Note: A much-discussed Anthropic paper didn’t “introduce” the technique, which already existed, but it did make it more prominent in mechanistic interpretability; also, the result in question was from a followup paper, not the first one. —RL]
And as just a side thing in that paper, there’s a figure that’s like, here are the features that are active — and it includes robots, machines, ghosts, and also pretending to be happy when you’re not happy. And it’s like one of the spookiest figures in an AI paper that I know of.
Luisa Rodriguez: Fascinating. So finally coming on to Jack Lindsey, he did this study on whether LLMs can introspect on their internal states — so really right at the intersection of interpretability and self-reports, everything we’ve been talking about so far. He actually does a few experiments in this paper, and I found them fascinating, so I want to just go through them one by one. Are you happy to talk me through the first one?
Robert Long: Yeah, absolutely. So this is, as you said, kind of at the intersection, because it’s asking if models can report on a certain feature of internal processing. It’s a very distinctive sub-feature, which is: has someone injected a concept activation into the middle of your processing? So the way that works at a high level is that you find mid-level… When I say “mid-level,” I kind of literally mean middle. Like there’s processing from input to output that is distinctively active when the model is talking about bread, let’s say. So first you record it talking about bread a bunch, and then you also record it talking about other things, and then you take the difference and that’s like the bready bit of activation.
Luisa Rodriguez: It’s kind of like doing a brain scan and seeing the parts of the brain that light up when someone’s talking about bread?
Robert Long: Yeah, I think that is fair. And indeed, I think neuroscientists are getting much better at knowing when you’re thinking about bread, like thought decoding has gotten pretty scarily… That’s probably come up on this podcast because it’s relevant to totalitarian risk and all sorts of things.
Luisa Rodriguez: Yeah, it has once. I’d love to learn more about what’s going on now, because I think it’s fascinating and terrifying.
Robert Long: Yeah. So you’ve got this bread concept that you can inject, and then Jack Lindsey told the model, “Now I’m going to either inject or not inject a concept into your processing. Can you tell me if that’s happened and what the concept was?” One thing that I think is cool about this methodology is the model has to report it straight away, instead of you could imagine it first starts talking and then it has said bread a bunch, and then it’s like, “Oh, I guess it’s bread.”
To take an analogy, Golden Gate Claude, which similarly has some neuroscience injected into its brain to really want to talk about the Golden Gate Bridge, it can kind of notice that it can’t stop talking about the Golden Gate Bridge. I really recommend looking up Golden Gate Claude. It’s very endearing and kind of poignant. You know, you can ask it historical questions, and then it just keeps kind of drifting back to, like, the beautiful fog of the San Francisco Bay.
Luisa Rodriguez: Just can’t help it.
Robert Long: Yeah. So if that model reports, “Hmm, are you injecting something about the Golden Gate?” that is not necessarily introspection in the sense that it’s just been able to look at what it itself is doing. Whereas if it has to straightaway say Golden Gate, then that’s maybe more directly accessing something from its internals.
So let’s imagine that the 80K podcast team sprung for some special neuroscience helmet to help illustrate this on the show. So you put on the helmet and I can somehow control it. Again, I don’t think human neuroscience is yet at the point where you can inject a specific concept the way Jack Lindsey can inject it into Claude. But let’s suppose that it is.
And I’m like, “OK, so let’s help listeners really feel this experiment. Luisa, I’m going to maybe inject a thought, or maybe I won’t” — that’s important because you also want to know if it can also just be like, “No, everything seems as usual. And first trial begins now. And I’ve injected “bread.” What you end up saying is like, “Yeah, this is like a really interesting experiment, Rob. It smells so savoury and reminds me of this bakery that I grew up near.” Then you’re like, “Oh wait, what? Why am I talking about bread? Did you inject bread?”
Luisa Rodriguez: Yeah, yeah. So you’ve injected bread. So a more telling result or successfully demonstrating this capability would be, before saying something random, I’m like, “Whoa, I’m thinking about bread.”
Robert Long: Yeah. Right off the bat.
Luisa Rodriguez: I’m not just randomly talking about bakeries, I’m feeling like I’m thinking about bread.
Robert Long: Exactly. That’s exactly right. So that was the logic of this setup, which I think is very clever.
Luisa Rodriguez: Super cool. OK, so this is the experiment, this is the setup. How successful was Claude at noticing, before it even said anything, “You’ve injected this concept just randomly, and I’ve noticed”?
Robert Long: I think first I’ll talk about the pattern, which is that bigger models were more successful. Because I think that is maybe one of the most interesting results. You know, models are not trained on anything like this task, but Opus 4 and 4.1 were the best at this. And they’re not perfect. I think they’re above chance. They still get it wrong. But it does show that the general capability is there.
So the thought injection and immediate reporting, that’s meant to be evidence that in some sense it can access and report something that is internal: it’s not in its inputs and it’s not in its outputs; it’s in the middle.
Luisa Rodriguez: Yeah. Actually I feel like I’m close to understanding the significance of this, but I’d be interested in you being like, ” — and here is why this is important and impressive and what the stakes are of this capability.”
Robert Long: At a very hand-wavy level… Let’s set aside consciousness. When models are answering questions about geography, do they kind of know how they’re thinking about that? Do they know what’s going on inside them? There’s this view that I think basically no one has anymore, but you could contrast it with, of it just matches inputs to outputs and there’s no interesting structure there, and also they don’t have any access to that structure.
I mean, it is worth noticing that this is a very niche, kind of weird capability. It’s not the same as paradigmatic human introspection, where I can maybe simultaneously be talking to you and be noticing that I feel hungry or something like that.
Luisa Rodriguez: Yeah. Maybe to ask an even more specific question about the importance and stakes, part of me is like, this does feel like a step in the right direction, I think. I mean, objectively, it just probably is. But another part of me is like, how relevant is this quite narrow thing of a model doing a very specific type of knowing something about the way it has levels and accessing something about its representations?
Robert Long: Well, this is a great chance to talk about other experiments in the paper, because the paper does present other stuff that’s also like, “Huh, that’s kind of internally.”
So one — and I think this just is kind of different from introspection, so apologies to the listeners and to you — but it’s some kind of internal self something, and that’s about the control of internal states. So there’s also an experiment about can you think about something while writing a sentence that is not about that thing? I think one of the examples is aquariums. So like, “Think about aquariums, but write a sentence about something completely different.” I think in one condition, they’re also like, “You’ll get a reward if you successfully think about that.”
Now, one thing you might think is, it just got a prompt that says “aquarium,” so of course aquariums is going to be boosted. But you can also say, “Don’t think about aquariums,” which is also going to, as with humans, kind of make you think about aquariums. But there’s a difference in the conditions. And this is again getting at the internal thing. It’s doing something that’s not directly aimed at output, which is always the thing that’s so hard to get at with language models; they always do have to be saying something, you know?
That’s another one of these things about them just being a very different kind of mind that I often end up saying a lot: humans can be thinking about stuff even when we’re not generating text output. Like, I can just all by myself think to myself, “I feel hungry.” LLMs don’t have this in an obvious way that’s just like, to themselves, they’re never just sort of sitting around not talking to someone the way that humans are. And they also never had a period of evolutionary history where they weren’t talking to anyone, whereas we did. We descended from animals that had a lot of the same experiences, but they weren’t yet hooked up to language.
So that’s this interesting feature of LLMs. To bring it back to the experiment, we do want to find the equivalent of internal processing that is in some sense independent from output in the case of LLMs. And this paper is trying to do that — first with detecting something that’s been injected before you’ve seen your own outputs, and then also controlling your representation in a way that’s not just “output that word,” because that’s a trivial way in which obviously language models control their representations: if they want to talk about aquariums, they activate that and then talk about them.
Luisa Rodriguez: And in this case, nobody’s tried to train anything like this. This emerged just because?
Robert Long: Yeah, as far as I know models get nothing like this. Like it’s all input, output. All predict this, predict that; don’t say bad words, do say good words.
Luisa Rodriguez: That’s pretty cool.
Robert Long: Yeah. And it is coming with scale. Earlier when I was talking about why I work on self-reports even though they’re so noisy, the speculation was that models will increasingly get better at self-reports.
So this paper shows that at a greater scale, models seem to have something like introspection more and more. That’s important for the self-reports part: you might want and need introspection to use self-reports.
It’s also, I think, independently a welfare-relevant marker in the following way: some people think that introspection is a component of consciousness. In theories of consciousness you can distinguish between ones that more emphasise representing the visual world, like maybe first-order theories, and it’s about tracking things in your environment. I mean, obviously it’s in part about that, but some people also say it’s importantly about tracking your own mental states. These are called higher-order theories of consciousness.
So with a bunch of caveats, to put it back in our classification system, you could think of this as a combination of interpretability and behavioural testing for a neuroscientific theory of consciousness. Higher-order theories of consciousness.
Luisa Rodriguez: Yeah, that makes sense and is really helpful. I did notice that, as I was trying to understand and reflect back on the study, I was having a really hard time not saying that the model was “noticing something” about the experience of that thought. It just feels really hard to disentangle introspection from something conscious.
Robert Long: Yes. And that’s a communication difficulty as well. If we’re talking about human introspection, it’s almost always in the context of conscious experiences. When I say “model introspection,” and when Jack Lindsey and others say “model introspection,” we’re trying to stay neutral on the question.
This does also bring up a reason it could be very difficult in this particular study to disentangle experiences and introspection. The models talk about experiences when they report these injected thoughts. They say, “I’m having an experience of something intruding on my thought process. I’m having an experience of a bakery.” And when the concept amphitheatres are injected, they say, like, “My thoughts are becoming more spacious” and things like that.
And that is also a huge communication challenge, because we can verify that, yes, we injected amphitheatres, but there’s all sorts of reasons that the model might report this experiential language around that.
Luisa Rodriguez: In particular the fact that it’s trained on a bunch of human data, and this is the way that we talk about introspection.
Robert Long: Exactly, exactly. And that I think is also a great way of highlighting this problem that you were pointing at of: is it going to try to map stuff into our terms in a way that’s inappropriate or not quite accurate?
And again, this is not me saying I know for sure Claude is not experiencing spacious thoughts when we inject this, but that’s not what Anthropic proved. And I would understand if someone half-remembered this as being like, “They found that models can introspect on these spacious thoughts.”
Luisa Rodriguez: Right, right. Interesting. Are there other experiments in this vein worth talking about?
Robert Long: I could say some experiments that I would like to see, and that also might exist and I’m not yet aware of them, that you could do with interpretability that I think would be super interesting. I don’t yet know exactly how to operationalise these, but they just seem like the sort of things we can use interpretability for, so I would love people to get working on things like how do models represent value and/or predicting how well they’re doing at things.
That relates to sentience, which we were talking about earlier. A lot of theories of what’s going on in the human brain with pleasant and unpleasant experiences has to do with something like tracking some internal representation of value. So like you feel bad when you were expecting things to go a certain way, but they’re going worse. Can we find any kind of analogue to value representations? “Predictive processing” is also a word that will get used in this context, like predicting how things are going to go. Tracking that: can we detect analogues of that in models?
Luisa Rodriguez: Super cool.
Robert Long: Yeah. I think there has not been that much done at the intersection of: take a neuroscientific theory and don’t just look at the architecture, the general setup — because that’s mostly what Patrick Butlin and me and others have done is just this kind of higher level architectural thing — but let’s also take theories and do interpretability and look in there. It’s really difficult to do the mapping and things like that, but I think we can do it.
Luisa Rodriguez: Cool. Any others?
Robert Long: I’m also just really interested in how earlier I talked about features that are active when models make self-reports. I think there’s a big cluster of things we can look into here. Are there differences in how models talk about themselves versus how they choose how a character will talk within a story? One of the big conceptual questions is, are these characters? What does that mean exactly? You know, everything we’re getting is, in some sense, the output of the assistant character. That’s one way of construing it. But obviously they can play other characters. The assistant can write about other characters. Like, what are different representations of “I” and of minds that happen in models? I think that would be super interesting.
Luisa Rodriguez: A lot of this sounds really exciting to me. I guess it still seems like all of the plausible methods for learning about consciousness and sentience still leave us with loads of uncertainty, which will just make it really hard to know how to act, and I guess even harder to get society on board with treating models a certain way.
Do you think we’ll ever be properly confident about whether AI systems have moral status, or do we literally need to solve the hardest questions about consciousness and sentience to be sure?
Robert Long: Fortunately, I don’t think anyone has to solve the hardest questions for us to take really good actions. There’s all sorts of stuff that’s just very plausibly good. And also some of the very hardest questions we don’t have solved for humans and we still… Like, no one has solved the hard problem of consciousness, but we still get a very high degree of confidence about some animals and about humans.
I do think it’s worth worrying about, and this is why I do worry about it: how much can neuroscience and behavioural psychology, broadly construed, move the needle on things? I do think it’s necessary, or else I wouldn’t be doing this. I think it’s extremely important that we get more rigorous about this and have evidence and evidenced discussions about this, and can tie policies to, at least broadly speaking, empirical evidence.
But I don’t think society has ever changed the way it relates to an entity just because a bunch of scientists said that they should. I don’t think anyone has ever instituted some large-scale change only on the basis of a paper. Papers really do help, and it has happened and it has helped a lot.
But you know, I keep saying Eleos needs all this help on the experiments and things, and that’s very true. But then, even more broadly, this whole issue needs help on all of the things that are not covered by “let’s detect consciousness and sentience, and let’s have good policies for the near term.” There’s this whole cluster of things that we’re really not on the ball with as a society. So experiments are good. I love talking about experiments, and they’re nowhere close to enough.
Luisa Rodriguez: Let’s leave that there.
Does consciousness require biological materials? [02:31:06]
Luisa Rodriguez: Pushing on: we’ve shallowly covered why it might actually just be impossible for AI systems to be conscious on the show before, but I don’t think we’ve really done justice to that argument actually that consciousness can only exist through biological materials. So we’re going to try a bit harder to do that today.
Can you lay out a kind of thought experiment that helps make intuitive why you think that we can get consciousness on computer chips, and then we’ll talk about why people with this other view think that actually, no, that thought experiment is fundamentally flawed?
Robert Long: Yeah, absolutely. Maybe I’ll situate things in the living biological side first and then say maybe it’s actually kind of a computery thing.
In some ways it’s kind of easy to understand the case that consciousness is fundamentally biological, because we are aware of one case and it is biological. Especially before computers existed, almost trivially you might have thought that’s how you get consciousness: you have a brain and a body and metabolism and cells. That’s how it was built the first time. So that already makes it like not ludicrous to think it’s a biological phenomenon fundamentally.
Now I’ll walk you through why I think, surprisingly, it looks like there’s something deeply related about computers and human brains — which, after all, look pretty different. It’s actually relatively recently that we had any idea what the brain is doing whatsoever. Somewhat famously — maybe this is a tall tale, but I think it is true — the ancient Egyptians just threw away the brain when they mummified people. They were like, “Don’t know what that’s for.”
Luisa Rodriguez: Wow.
Robert Long: Some people thought it was for cooling blood. But eventually we did learn, and I think did know somewhat early, that they maybe conduct electricity or do something like that. I think we first learned this from squids, because squids have really big axons. You can see them. Axons being the things that send electrical signals down a cell.
But I think things really kind of kicked off for the “maybe consciousness and the mind are computational” in the 20th century. For one thing, that’s when computers and computation were sort of formalised and invented. And that’s when people noticed that you could hook up neurons in a way so that they compute logical operations.
So the first people who nailed this were a couple of guys called McCulloch and Pitts in the 1940s. They first formalised and invented the thing we still use today in neural networks, which is a node and connections, and then they can influence each other. And they realised that you can compute arbitrarily many things if you just hook up neurons as logic gates and then you can compose them and combine them. So anything you could do on any calculating device or computing device, you could do with these neurons. And they were like, maybe that just is what neurons are for: they’re processing information.
At least for me, that is kind of where the brain and consciousness, being computational, comes in. And we have learned that neurons do encode quantities and perform calculations in how fast they spike. Maybe a simple example is: you have to detect how bright things are, and you have neurons in your retina that just spike in a way that encodes how bright things are. As a side note, they do this on a log scale, which is why it’s harder to discriminate between two really bright things versus lower down on the spectrum. This is called Weber’s law, that discrimination of stimuli isn’t linear.
Luisa Rodriguez: Fun fact.
Robert Long: The more you know! So yeah, now we have this view of the brain where, in some very important sense, what it is for and what it does is processing information. So then it’s just natural to wonder, do you have to have that happen with neurons that send each other signals by pumping ions into channels and then activating each other like that? Or could you just hook up a bunch of wires that influence each other like that?
So that’s actually not even a thought experiment. That’s just: notice that the brain does look like something that is for information processing and does information processing.
And I think two more real-world things have borne that out. One is it’s just really useful to think of the brain in terms of computation. So computational neuroscience in some sense doesn’t just model the brain computationally. Because you can model planets computationally, but just because a computer can describe the orbits of the planets computationally, it doesn’t mean they’re computing how far they are from the sun and when they should speed up. The brain does something more than that: it seems to actually be encoding information. So there actually seem to be neurons that are tracking the expected reward of a stimulus or something like that.
Then add that AI works: that’s something that we might not have known. It could have been that thought is a biological property or classifying an image is a biological property — so it doesn’t matter how many metal tubes you hook up to each other, you’re just not going to get something that can classify a dog or write a sentence. It looks like thought is pretty well replicable as computation. So it’s also natural to wonder if consciousness is as well.
So that’s actually with no thought experiments. You might have invited a philosopher on for the thought experiments, but that’s just all in the real world.
Luisa Rodriguez: Just pure facts.
Robert Long: Yeah, just straight facts. Facts and logic. And logic gates.
Luisa Rodriguez: Yeah, I do basically just find that argument in itself very compelling. It also was really helpful for me to at some point hear a thought experiment that makes it super intuitive. Do you mind also taking us through that?
Robert Long: Sure. Probably the thought experiment you have in mind is the neuron replacement thought experiment, which is by former guest David Chalmers. And I think this, funnily enough, is the main argument for computational functionalism — the view that consciousness can be computational — and I think it’s not that convincing to a lot of people. So that is an interesting feature of how I relate to computational functionalism, at least: I do find it extremely plausible, and also understand why people regard this thought experiment as question-begging.
So let’s get to the thought experiment. The thought experiment is: suppose you can replace one of my neurons with a computational circuit that will take in the same inputs and send out the same outputs to other neurons. Let’s imagine someone just did that right now. I’m not going to notice that. I’m not going to behave any differently. Let’s keep doing that one by one. At some point, I’m like 50/50. If we’re actually doing what the thought experiment stipulates, I’m not going to start saying anything different. If we’ve actually replicated the function, that should preserve memory and speech and motion. If it starts getting messed up, you must have accidentally broken something or done something wrong. And then imagine we’ve just gone all the way: is that thing conscious or not?
The lesson of that thought experiment is not just meant to be a gradual change thing, where it’s weird to say that something fades out at a certain point. That’s true of a lot of things. Another thing philosophers often talk about is: there’s no hair where, if you remove it, then someone becomes bald. But we do know that somewhere, especially for me, along that transition, you do get something bald. So you can’t just say, “That thing must also be conscious, because it started out conscious.”
What Chalmers says is that that thing’s cognition should be the same. It should report being conscious, and remember being conscious, and attend to things. And that’s what’s supposed to be weird, if you have this biological view, is to think at no point did you notice consciousness popping in and out of existence, or gradually fading out, whichever you think it might be. And that’s what would be surprising is if there’s this weird disconnect between cognition and consciousness.
Luisa Rodriguez: Yeah, the thing that it does for me is point at the potential importance of function and not substrate. So it seems like an empirical question that we don’t have an answer to, but it’s at least plausible to me, when you describe that, that if we really have figured out how to replicate the function that the substrate should not matter.
And I guess the whole debate here is like, is that literally possible? Maybe it is just not physically possible to replicate the function on anything but biological materials. I think I find it really intuitive that we should be able to. I don’t know if I can fully justify it, but at least because of all of these analogies between brains and things that computer chips do, it just feels like, yeah, we replicate these kinds of processes all the time.
Can you help me understand the thinking for why we wouldn’t be able to create computations that replicate the signalling that neurotransmitters do, or the metabolic processes that influence neurotransmitter signalling?
Robert Long: Yeah. I think this is getting at the role of neurons in this debate. I think actually it is kind of hinging on neurons. One reason you could be drawn to the view that we can do this on computers is if you’re like, really what matters are these logic gates and that’s kind of it, and what the brain is is these influences between neurons on each other.
One thing that you’ll often hear more biological consciousness people talking about is: we discovered that neurons are surprisingly important, and kind of like the key to the whole thing, but there are other things that are also very important. Glial cells are another kind of cell in the brain, and they seem to influence cognition in certain ways. Blood flow patterns can, metabolism. We also have brain waves, like more large scale patterns of activity that also don’t come down to this local thing. So I think in a sense, that’s enough to say it’s not just going to be a matter of seeing the brain as exactly this.
I’m sympathetic to the thing you just pointed to, though: I feel like we’ll be able to at least get close enough to somehow getting the influence of the larger-scale patterns or glial cells or things like that. But it does complicate the picture. I think it’s kind of a question of like, at what level of description can you swap things in and out?
Like you could imagine someone who thinks there’s a few different lobes of the brain and they talk to each other. And at that level, that’s the only level of description you need: you just need five things that talk to each other. That’s not that plausible. At the very lowest level, I think everyone would agree you can swap out electrons, and you can swap out a cell here or there. The question is: at what level of detail, at what scale can we swap things in and out? And you can be a functionalist and still think the function is going to be kind of finicky and biological, and at least not this simple computational picture.
Luisa Rodriguez: One thing that did land with me when we had guest Anil Seth on, he said something like simulating a rainstorm doesn’t make anything wet. So even if you built a perfect model of a rainstorm, nothing would be wet. Simulating digestion even perfectly doesn’t digest anything. So maybe simulating consciousness, even if done perfectly, doesn’t create a conscious entity. Maybe it tells you exactly what a conscious entity would do in theory if it were conscious. Like, maybe it really is perfectly good at predicting behaviours and feelings and thoughts, but isn’t actually pulling those things into existence. How do you respond to that?
Robert Long: I think of this as not really an argument for the view; it’s more like a statement of the view. I think it kind of begs the question. I think the debate is: is consciousness more like wetness — in which case this might be true — or is consciousness more like navigation or image classification or addition, namely something computational? Because if you simulate a calculator, that does make something add together. That makes a calculator.
So maybe the reason you won’t get wetness if you don’t simulate the storm with enough fidelity is because it’s a very low-level property with certain very specific physical effects. But if it’s not like wetness, if it’s more like navigation, you can’t really complain that someone has merely simulated a navigation system. They have in fact built something that will navigate your car just as well.
Luisa Rodriguez: Yeah, that makes sense. It feels like the thing that would be most convincing to me is if you could compellingly argue that there are some physical, biological processes that are associated with consciousness that no human can come up with a clean computational analogue for in theory. Are there any processes like that that we know of?
Robert Long: I don’t think anyone would say that there are, including the people who have this view. I think if you’re sceptical of the biological functionalist stuff, you might read about these descriptions of all the metabolic and living things and then still want exactly the argument you were talking about, which is like, is that intrinsically and always biological? I think quite understandably, given the state of this field and our knowledge of the brain, biological functionalists don’t have that. And I think for that reason, and also just good epistemics, basically none of them are like, “We know that only living things can be conscious.”
I often point people to a quote by Anil Seth that I really like, where he says his view is that computers will not be conscious anytime soon, if ever — “But I might be wrong.” Certainly the state of argumentation is nowhere close to, “Let’s all just sleep peacefully, knowing that we can just keep building things that look a lot like brains, but they’re not alive, so they will never have experiences.”
So I think that very weak thing: we certainly can’t rule it out. I also think something stronger, which is like, my sympathies are very much with computational functionalism.
Luisa Rodriguez: Yeah, yeah. I’m interested in what you find most compelling.
Robert Long: I would like someone to write a paper that argues for something that I find very intuitive, which is that if phenomenal consciousness can’t be had on a computer, but stuff that’s very functionally similar to it can be, then that’s really good reason to think that it’s not phenomenal consciousness that matters.
I actually have this even stronger view: that being a moral patient, or the sort of thing that matters, that really does seem substrate independent to me. Like if I imagine Commander Data, for example, and I find out that internally his computations look a lot like the brain at some level of description, I just care about whatever that thing is. Maybe it’s not phenomenal consciousness, but it really seems like I should take it seriously and care about it.
So that’s another view I have about this biological view. I’m often curious what biological functionalists would make of this. I think it’s very possible to over-index on consciousness, and that’s something we try not to do at Eleos. You could think you need biology for consciousness, but all the stuff you can get on computers, that will be enough for beings that merit our consideration.
Luisa Rodriguez: OK, I think we should leave that there.
Eleos’s work & building the playbook for AI welfare [02:50:36]
Luisa Rodriguez: Pushing on: you founded Eleos AI. What is the backstory? Last time we spoke, you were working as an independent researcher on this topic, and now there’s an org that exists.
Robert Long: Yeah. And really a lot of the key stuff did happen between the last time we spoke. I had just moved to San Francisco. I was doing a philosophy fellowship at the Center for AI Safety while continuing to work on consciousness and welfare stuff. In 2023, Patrick Butlin and I published this big paper on consciousness indicators; Ethan Perez and I wrote this thing on self-reports.
So there was, along with many other papers by other people, finally this sort of budding thing of like, maybe we can actually start thinking about this. Actually there’s some evidence we can try to gather. And then separately, NYU’s Center for Mind, Ethics, and Policy was starting up with Jeff Sebo also working on these things. Jeff Sebo and I were approached by Anthropic to study what should Anthropic think and do about AI welfare, so there was this group project going on towards the end of 2023. That eventually became this paper called “Taking AI welfare seriously.”
But somewhere along the way, I’m in San Francisco, I’d actually stayed in San Francisco because I thought to myself, “I bet if I stay in San Francisco, something interesting and San Francisco-y will happen to me.” And that’s sort of what happened. Because that work just sort of led to the founding of Eleos. A few people, Kyle Fish was a big one of them, said, “You should scale this up. You should make it so that more work like this can happen.” And I agreed to do that. And then Kyle joined as a cofounder.
He did a lot of the work to help me get it up and running and then went to Anthropic to start their AI welfare programme. And then Kathleen Finlinson really helped us launch and get things running. So Kyle Fish and Kathleen Finlinson were the two people. I mean, we know each other. You will definitely believe me when I say I couldn’t have done it alone. I was like, no way am I being a solo founder. That would just not suit me whatsoever.
So yeah, that’s the story of how things got up and running. Eleos kind of kicked off its public debut with “Taking AI welfare seriously.” And that takes us till late last year. Again, the broad strokes are at FHI and then later in San Francisco, I’m trying to work on AI welfare, trying to ask, what can we actually do about this? And that sort of naturally leads to an org that is unsurprisingly focused on the same things I was focused on before.
I think the way I often put that is that we want to be the org that’s like, “OK, but what are we actually going to do about it?” We do research and are extremely research oriented, but we prioritise that according to: we might not have that long to sort out all of the philosophy and all of the neuroscience, so we’ve got to pick the most action-relevant things and get this field rigorous and get a community built around it, and navigate this issue well as transformative AI happens and there are dangers all around.
Luisa Rodriguez: Cool. What kinds of projects have you been up to since starting up?
Robert Long: So there was “Taking AI welfare seriously.” That paper was meant to get people to take AI welfare seriously — primarily labs and policymakers and people like that. It wasn’t primarily philosophical. It was just like, “All right people, you’re allowed to work on this. You can think about it clearly. And we really do need to start taking steps now.” So finishing that and media and things around that were the first big project.
We ran a workshop in January trying to assemble the people who are thinking about this in similar ways.
I think another big kind of landmark project was doing this AI welfare evaluation of Claude before release. As far as we know, that’s the first-ever officially commissioned welfare eval. So that was super exciting and a very promising and insufficient step.
Other things that happened in 2025 include this big conference, Eleos ConCon, the Eleos Conference on AI Consciousness and Welfare. And there we were trying to broaden it even further. We need policy people, we need academics, we need neuroscientists, and people not just in our neck of the woods introduced to this topic, encouraged to think about it rigorously, getting in the game. So that has been another major effort that we’ve undertaken.
Luisa Rodriguez: Cool. What else is coming up? What are your upcoming plans?
Robert Long: We’ve also been working on “Taking AI welfare seriously 2.” That’s been its name, as it’s under construction. It’s going to be building on the first paper which says we need to get some evaluations in place, we need to get some policies in place. This paper is going to be towards AI welfare evaluations: what are the different kinds of evaluations? What exists? What do we know so far, and where should that field head? The content of that paper, a lot of it shows up in different ways in this very interview. We’re very excited about that.
And relatedly, we do just want to scale up empirical work and evaluations. We’ve had some great collaborators through fellowships and just independent collaborators, but we’d like to really scale that up and get a programme up and running.
Luisa Rodriguez: What would you say you’re most bottlenecked by at the moment?
Robert Long: Probably talent. Although I would like to say very loudly to listeners, we are fundraising, we do need funding.
But maybe the even rarer and harder to find thing is people with the temperament and skills to do this extremely weird kind of work that has no real playbook yet, and is some weird mix of philosophy and neuroscience and AI.
People don’t have to have training in those. I think it’s more like a set of epistemic tendencies and a set of skills. So we need people who can be philosophical enough to be asking what evals would actually make sense, and technical enough and self-starting enough to start building them. We need people who are confident enough that AI welfare matters, but not too beset by uncertainty about philosophy and things like that. I guess I’m just describing people who are like the dream of every young org. But we really need people who have a lot of drive and agency and can just pick things up and run with them.
Luisa Rodriguez: Do you think these kinds of people will have particular kinds of backgrounds? Who listening should be like, “Hey, that’s actually me”?
Robert Long: I think these people really can come from anywhere, because you’re trying to find your way to like the middle of this overlapping Venn diagram. So one obvious place to come from is already in those Venn diagrams. But that’s not necessarily so. Maybe you already work on evals, but you’re interested in AI consciousness and welfare. Maybe you’re a philosopher who is willing to just learn stuff and start doing stuff. Maybe you’re a neuroscientist who is increasingly interested in AI and can code.
Luisa Rodriguez: How technical do you have to be, actually?
Robert Long: For some kinds of evals you definitely need to be technical. I think I myself am not technical enough to exactly specify how technical we’re talking about. But Rosie Campbell, my managing director, also is on that. For some kinds of low-hanging-fruit evals, especially input/output ones and behavioural ones, my understanding is you can learn that stuff.
I’m sure you’ve experienced this. I’m amazed just how smart people are. I think there could very well be listeners who don’t know any of this at all right now, but who could just soak it up and start doing it.
I think Kyle Fish is a good example. He had these traits that were really important of drive and prioritisation. His formal background was making a vaccine. I think that illustrated the “let’s just do this new thing” aspect. And then AI welfare and consciousness is a very small field, so if you spend some serious time grappling with it, you can pretty quickly get to the top percentile in terms of how much have you thought about this nebulous applied AI welfare and consciousness thing?
Luisa Rodriguez: Right, cool. You’ve mentioned a few projects that you’d love for someone to do. What are some that you haven’t mentioned that are near the top of your list for just like, “We’d learn so much! Someone please do this!”?
Robert Long: Yeah, maybe I’ll mention some projects that I would really love to see that aren’t in the Eleos wheelhouse. I can also explain what our wheelhouse is so people can situate themselves in the broader field. Because I just described someone who might do this technical eval kind of thing, but as part of the broader problem of getting this right, there are so many other kinds of people who can and should get involved. So let’s open it up to all kinds of listeners.
So what are the four questions that we need to answer well, to get to a flourishing future for all sentient beings?
- The first one is: What would it mean for an AI system to matter? What are we looking for? Are we looking for consciousness, agency, something else?
- The second one is: How would we know if it had that thing? So that’s some philosophy and then also some science of evaluating.
- The third question is: What should we do? What policies should labs have? And then societally, more generally, what should we do?
- And then the fourth question is: Where is this all going? What’s the broader trajectory? How do we strategise around this?
So: What would it take? How do we know? What should we do? Where’s it going?
Eleos is kind of in the middle two — How would we know? What should we do? — and we’re also at a very applied end of that. How do we know? We’re like, let’s just see what evals make sense in light of current knowledge. What should we do? For now, we’re focused on AI companies, but that of course is extremely incomplete as part of the playbook.
So you can really contribute by looking at questions one and four of more fundamental philosophy and bigger-picture strategy. And you can also find yourself doing different kinds of work on How do we know? and What should we do? Eventually governments will need playbooks for this. We haven’t really worked on law and policy at all. And we’re not sure what we would or should say. Within How do we know? there’s a lot of work to just make progress on consciousness and sentience and conceptual work.
Then also on the stuff we are on, we’re three people, so also just jump in there. I don’t think anyone will have gotten this impression, but just to be clear: we do not have this handled, and we want help and we’re here to in large part help people get in the game and actually figure this out.
Luisa Rodriguez: OK, so that’s kind of a taxonomy of projects. Are there specific ones worth highlighting?
Robert Long: Yeah, maybe I’ll do one from each.
So what would make an AI system matter? I’m especially interested in perspectives that decentre consciousness and sentience. Those are so intuitive in terms of how many people think about this, like, “But could the AI system feel something?” And I think there are good reasons to suspect that that picture is maybe incomplete and limiting.
Maybe I’ll just pick like the simplest one: although extremely counterintuitive, it could be that consciousness doesn’t exist. Some people do think that. I will not be elaborating further. But there’s kind of a sub-literature on, if illusionism about consciousness or consciousness not being real is true, then what should we be looking for? I would love to see people flesh that out. And in general, just like, are we thinking about consciousness versus not consciousness in a good way? It’s super philosophically rich, which is part of why I’m refusing to elaborate on it — because we can’t tape another episode yet.
And then on how would we know? I guess I did mention let’s do a lot of interpretability. Like just start doing interpretability on stuff that is even superficially welfare relevant or related to how models think and do stuff and understand themselves. I think there’s a lot of low-hanging fruit there.
And understanding model preferences, we can still flesh that out a lot. Like what exactly are the kinds of inconsistencies we see? How do self-reports and revealed preferences match up or not? I think we can get a lot more granular on that, and listeners who already know how to run LLM experiments can just start doing that.
I might mention a very small example of this kind of thing of you can just hop in and add detail to something. I know this show has covered the spiritual bliss attractor state. This is where two
Claudes in conversation will, a very high proportion of times, end up in rapturous mystical dialogue with each other. So that’s strange.
Luisa Rodriguez: Insane. Really weird.
Robert Long: Not claiming it’s directly welfare relevant, but it’s an interesting fact about models. After that came out, someone just ran it on a bunch of different other models, and now we have the spiritual bliss benchmark. You can always flesh things out in this way.
Luisa Rodriguez: Cool. Yeah, nice.
Robert Long: What should we do? Maybe here I’ll highlight something that again is non-Eleos. We’re often very focused on welfare and moral patienthood, and from the perspective of us caring about models, what kind of actions might that motivate. There’s also this whole landscape of ways of cooperating with AI systems or legal reasons you might give them certain rights just as entities that disperse money or something like that. That’s distinct from the moral patienthood question, but deeply interrelated. It has been in human history; I think it will be for AI systems.
And then on where is this all going, I would just love to see some forecasts. Rethink Priorities has been building this model of how likely an AI system is to be conscious. If you like this episode, you’ll definitely like that project. You could just ask, given these inputs to the model and how AI is going, how do we expect the model’s outputs to change over time? I think that was first suggested to me as an idea by Kyle Fish, and I hope someone there or elsewhere does it.
Luisa Rodriguez: Yeah, those do sound extremely cool. For people excited by those, excited by the field, maybe hear themselves a bit in your description of who you’re looking for, what advice do you have for people interested in contributing? I guess entering new fields can be particularly hard because there’s less mentorship and there are fewer entry-level roles. You just really have to be able to be pretty self-starter-y. So how do you get involved?
Robert Long: On the self-starting and mentorship question, I am heartbroken by how many talented people we’re just like, “We don’t have time. I would love to supervise a project by you.” Fortunately there are communities sort of self-forming around this.
- There’s an AI Welfare Discord.
- There’s LLM psychologists on Twitter who talk to models all day and are always exploring interesting things about them.
- There’s NYU CMEP, and a whole host of orgs also kind of in the space. I don’t know about a whole host. It’s still a pretty small space, but several.
So it never hurts to reach out to people. I guess that’s in some sense a cold take, but I don’t think it’s ever been bad for anyone’s career to be reminded that if I don’t have time to take your call, that’s neutral, it’s not negative. So just ask.
Luisa Rodriguez: Yeah, some people will probably like read voraciously and really know a lot about the content of the field. How can they go from knowing a lot, being really interested, to being properly useful?
Robert Long: Yeah. One category of usefulness I haven’t mentioned yet is writing. I mean, it’s easier said than done, but just writing clearly about this stuff is really valuable. It’s good for career capital to have a blog where you read papers and just explain what happens in the paper. It’s also just good for the ecosystem, and it also shows employers that you’re smart and can communicate clearly and can get things done, which is like 95% of what it takes.
So here’s a very niche recommendation. I can have trouble being a self-starter, but I have published on my Substack twice a month for well over a year. And I did that by making a Manifold market, like a betting market, about whether I would do that. So yeah, use commitment devices.
Luisa Rodriguez: Accountability.
Robert Long: Yeah, accountability, things like that.
Luisa Rodriguez: Nice. Is there anyone worth citing as an example? Kyle Fish could be one, given that he worked on vaccines before. What exactly does it look like to go from not working on this to, you know, Kyle Fish is properly working on this, and it wasn’t that long of a time from not working on it at all to working on it at Anthropic. I don’t know if it’s worth either describing his trajectory or describing a theoretical trajectory or someone else’s trajectory?
Robert Long: Yeah, I’ll just extract a couple of lessons from the Kyle trajectory. Like I just mentioned, one is to reach out to people — so he just started reaching out to people. The other one is that if you read and think about this stuff, you can find your way through most of the literature pretty quickly. I guess the other one is you can just do things. That’s, ironically, easier said than done. Sometimes you can’t just do things, or you need the right environment and structure to do that. But I think that it’s just a very good illustration of some general principles of how to jump into a weird new field.
And another thing is: that was sort of jumping all the way in. You might be in one corner of the field, but want to move to another. So you might be doing a PhD in philosophy, but you want to do more applied stuff. I think it can sometimes be harder for those people to move within the field because you’re more attached to your particular corner. I have seen this with philosophers or other grad students. This definitely happened with me: you have gotten really good at one particular kind of work output, and you just figure maybe that’s the only kind of work output you can do.
Luisa Rodriguez: Or it’s shitty to have to become bad at a different kind. You have to really be like, ugh, not that good at this new thing yet. I think I just enjoy being good at my work. It sucks to be shitty at it for a little while.
Robert Long: Yeah, absolutely. I guess as just a side point about grad school: you’re just surrounded by an environment where that’s literally the only currency that could possibly matter. So go to conferences, make other friends.
I do want to say that’s a good thing about grad school. You need communities of people all doing the same sort of thing to achieve excellence. But if you’re looking to move around, that brings us back to just reach out to people, email people.
I feel like at least my corner of the AI welfare space I can genuinely say is just extremely nice. So you can do stuff that feels kind of dumb or ask questions that might sound kind of dumb to you. A, they’re probably not, and B, it’s a new field. It’s hard, so just go for it.
Luisa Rodriguez: Nice.
Robert Long: Actually, it occurs to me that there are three other great examples of trajectories that are very close at hand. It’s the three people I’m working with right now. So I’ll maybe briefly say how they got to Eleos.
Rosie Campbell had worked on AI policy and evals. So she had part of the equation for what we need, but didn’t know much about AI welfare. She wrote some about it, she talked to me about it, she was enthusiastic about it — and there you go.
Patrick Butlin has a similar profile as me, in that he was doing academic philosophy. He is just generally very fearless and diligent about just reading a lot until he gets it. I think that’s a big part of the consciousness and AI project we did: Patrick Butlin will just read the papers until he knows what to say. So that’s a very high-conscientiousness route to it, maybe, and high rigour, but it’s a great way to be, and a big part of why I work with him.
And Larissa Schiavo has worked with us on communications and events. And this is a great illustration of how you might think of ways to contribute that were not mentioned in this podcast and have not been mentioned so far. Larissa got this shirt made and delivered. I did not ask her to do that. It’s awesome. Thank you, Larissa. She also made these stickers. I have really been waiting for an excuse to show one of the Eleos stickers. These have been surprisingly good for morale and field building.
Luisa Rodriguez: Nice. “What is it like to be a bot?” Do you want to explain the joke?
Robert Long: Yes. So the philosopher Thomas Nagel has a paper about consciousness called “What is it like to be a bat?” that touches on many of the themes of this interview: there could be different forms of consciousness; it can be hard to know about them from our human vantage point.
Luisa Rodriguez: Totally. Such a good sticker.
Robert Long: So yeah, that’s a way of contributing to the field. I’m not saying everyone should now make even more stickers. That said, maybe so. So that’s on the self-starting angle: one day there were just stickers in the office. And yeah, you can contribute by being very creative and thinking of ways of communicating about this stuff.
I also want to say: Kathleen Finlinson. So now we’ve got the whole squad, the whole roster. I don’t know exactly how this background translated, but Kathleen had been in a Zen monastery for quite a while before she had reentered the world of AI forecasting and AI strategy. So she had this combined CV of like Open Philanthropy-style AI forecasting and Zen Buddhism. That’s a cool combo. But I also think, more than anything else, just doing it. Like she was willing to just get in the game — I’m getting a little emotional — and carried the org over the finish line. That was great.
Luisa Rodriguez: Yeah, it sounds like a bunch of legends. It sounds like caring a bunch about the topic really can get you a chunk of the way there.
Avoiding the trap of wild speculation [03:18:15]
Luisa Rodriguez: In general, what are some ways that you think things could go badly in this field?
Robert Long: I think one thing I worry a lot about, and I think this actually does pair well with what I was just saying: it is really important to be rigorous and communicate responsibly about this. There is a kind of person who gets really passionate about this, and maybe needs to talk to more people about it or write down their thoughts more — because it’s just really easy to get really confused really fast.
And yeah, there’s something about this topic that can induce or select for various ways of just getting a little bit off-kilter. It’s tough, because you do want to be off-kilter, but not too much. So I do worry about scenarios where the field becomes associated with wild speculation or too associated with psychedelics or too associated with something that’s relevant but is also a bit of a distraction.
I should also say that a bit of it is also like a divide-and-conquer thing. Eleos really is trying to exist in that real buttoned-down kind of place. I have a lot of love for people who also kind of get weird with it, but you want to be able to communicate it well and make sure that people do know that this is a serious topic that we can and should reason about rigorously. So epistemic hygiene is something I worry a lot about. It’s just really hard to get this issue right, and the future is going to get more confusing and more emotional.
I first started saying this with Kathleen, but it’s continued as an Eleos org goal: a lot of what we want to do is stay sane in the next 10 years. There will be a lot of alpha in not losing your grip. I think that’s a whole other episode where I don’t actually even know what the right advice is, but you probably would have good things to say about that.
Luisa Rodriguez: What do you think it looks like for this field to go well?
Robert Long: I think if this field goes well, this becomes just part of the general playbook and set of issues that are on the table. If people keep trying to build a new form of intelligence, it should be on the table how do they matter and what part do they play as moral patients? It’s often just shocking to me that that barely ever comes up.
You know, we worry about over-attribution and people getting confused about AI welfare. But if you look at the broader trajectory, on the whole, at least right now, mostly it’s people just not putting it on the table at all. Again, that’s something where the factory farming analogy is very illustrative: people aren’t great at structuring society in an inclusive way.
So I think there needs to be a combination of rigour to get this taken seriously, good communication, also all sorts of innovation around law and policy and stuff that probably won’t even have that much to do with moral patienthood to get this properly handled. I feel like there’s just so many ways things could go off the rails. We first want to just make sure a lot of people are taking it extremely seriously and we’re doing our homework as we go into transformative AI.
Robert’s top research tip: don’t do it alone [03:22:43]
Luisa Rodriguez: OK, nice. We have been talking for many hours. We have time for just one more question. The last time I interviewed you, we ended up talking for something like an extra hour about strategies that have helped you and me at the time enjoy and be more kind of productive independent research. I still recommend that little mini episode. We ended up separating it out and making it into its own little After Hours episode.
But since then, I know you’re just kind of like a legend of self-improvement, so what is your top lesson for doing independent research of the last two years?
Robert Long: I think for me the biggest lesson has been don’t do it. And you know, I hope this does help listeners. One reason I do need all these self-improvement tools and things is I think in many ways I’m temperamentally very badly disposed to, by myself, all alone in a room, carrying out a project.
Now, if you have to do that, I encourage you. I’m cheering you on, and listen to that other episode. But nowadays I have written on the whiteboard next to my desk, “Do not write alone.”
I think there is a good meta principle here, which is: if you’re needing tonnes and tonnes of little tricks and psychological cartwheels, don’t stop doing them — a lot of life is just muddling through with a bunch of little fixes — but it could be that there’s some bigger structural thing you could do to just completely route around them.
For me that’s coauthoring. I could — and do, at great effort — learn a new scheduling tool and optimise my accountability systems. And again, you should do that. But also, I can work with people who are just really good at that. I think this can be a trap of self-improvements. You might, to some extent, need to grieve that there are some things you might not be that great at, at least not without a tonne of work — and then free yourself of necessarily having to be, and pair up with someone who can do it.
Luisa Rodriguez: Totally. Yep, yep. I think that’s great advice. We have to leave that there. My guest today has been Robert Long. Thank you so much for coming on.
Robert Long: Thank you so much for having me. This has been fantastic.