#242 – Will MacAskill on why AI character matters even more than you think

By Robert Wiblin · Published April 22nd, 2026 ·

Hundreds of millions already turn to AI on the most personal of topics — therapy, political opinions, and how to treat others. And as AI takes over more of the economy, the character of these systems will shape culture on an even grander scale, ultimately becoming “the personality of most of the world’s workforce.”

So… should they be designed to push us towards the better angels of our nature? Or simply do as we ask? Will MacAskill, philosopher and senior research fellow at Forethought, has been thinking through that and the other thorniest issues that come up in designing an AI personality.

He’s also been exploring how we might coexist peacefully with the ‘superintelligent AI’ companies are racing to build. He concludes that we should train such systems to be very risk averse, pay them for their work, and build institutions that enable humans to make credible contracts with AIs themselves.

Will and host Rob Wiblin also discuss what a good world after superintelligence would actually look like — a subject that has received surprisingly little attention from the people working to make it. Will argues that we shouldn’t aim for a specific utopian vision: we don’t know enough about what the best possible future actually is to aim directly for it, and trying to lock in today’s best guesses forever risks baking in errors we can’t yet see.

Will and Rob explore what we can do to steer towards a good future instead, along with why a coalition of democracies building superintelligence together is safer than any single actor, how absurdly useful ChatGPT is for analytic philosophy, and more.

This episode was recorded on February 6, 2026.

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Music: CORBIT
Camera operator: Alex Miles
Production: Elizabeth Cox, Nick Stockton, and Katy Moore

The episode in a nutshell

Will MacAskill, philosopher and founding figure of effective altruism, has spent the past year with his colleagues at Forethought developing a suite of frameworks and proposals for steering the world safely through the transition to superintelligence.

This wide-ranging conversation covers five major clusters of ideas:

AI “character” as a high-stakes lever we have now (and aren’t taking advantage of)
Novel proposals for making deals with potentially misaligned AIs
Will’s positive vision for post-superintelligence society: viatopia
Will’s latest philosophical work in coordination to fund moral public goods, and his new “saturation view” as an answer to some of the worst population ethics dilemmas
The state of effective altruism post-SBF, and where EA-minded people should focus in the age of AGI

AI character is the most underrated lever for a good future

Will argues that the personality and dispositions AI models have is one of the most important — and most neglected — factors in how the transition to superintelligence goes, for multiple reasons:

Scale of influence now: AIs already interact with millions daily on political views, ethical dilemmas, and personal wellbeing — and only a handful of people inside AI companies are effectively setting those dispositions.
High-stakes rare situations: How does AI behave in a constitutional crisis, a power grab, or when being used to design the next generation of AI? These are narrow but critical moments.
Broad everyday influence: How does AI affect our capacity to reason and morally reflect? Does it foster trust or dependence? Does it make us think of AIs as tools or beings?
Precedent for superintelligence: Current AI character shapes training pipelines that may influence superintelligence itself — like “writing instructions to god.”

On how much AI should nudge us toward virtue, Will favours a middle path: prosocial drives rather than promoting a particular moral view. AIs could be trained to help users reflect on their values, consider consequences for others, and engage more ethically.

Risk-averse AIs could dramatically reduce takeover risk

Will also explains how teaching AIs to be risk-averse (in the same way humans are) could dramatically reduce takeover risk: if a misaligned AI prefers a guaranteed modest payout over a risky gamble for world takeover, it would choose to strike a deal rather than rebel. This is analogous to why rebellions are rare in rich democracies: people have too much to lose. Some of Will’s ideas:

We could give AIs resources, income, and welfare standards so they have something to lose.
We could pay AIs bounties for revealing that they (or other AIs) are misaligned.
AIs already emerge from pre-training somewhat risk averse (because humans are). Via Rabin’s calibration theorem, even tiny risk aversion at small scales implies enormous risk aversion at cosmic scales.
Challenges include making AI–human deals credible (perhaps via dedicated nonprofit institutions, as seen in cryonics) and helping AIs distinguish real offers from honeypot tests (perhaps via “honesty strings” that companies commit never to use deceptively).

Viatopia: aim for a good way station, not utopia directly

Utopianism has a bad track record: philosophers’ utopias quickly look dystopian, because we don’t yet know what an ideal future looks like.

The alternative of “protopianism” (just fix obvious problems one at a time) risks missing existential-level threats.

Viatopia is a third path: get society into a state where it can steer itself toward something truly good. Key questions include:

How widely is power distributed, and who has power (humans, AIs, future generations)?
When should major decisions be made, and what decision-making processes should we use?

Will draws an analogy to the US Constitutional Convention: locking in a deliberative process that allows experimentation, rather than locking in a particular outcome.

Forethought has explored what the best version of a multilateral AGI project would look like — primarily as a safer alternative to a US-only project, not necessarily as the ideal path. Will argues that any single democratic country has a reasonable chance of becoming authoritarian, but a coalition of multiple democratic countries is unlikely to all fall. Furthermore, multiple countries writing an AI constitution together would be less likely to produce an AI entirely loyal to one head of state.

Multiverse coordination and a new theory in population ethics

Will explores Tom Davidson’s idea about how superintelligence might solve the “free-rider problem” to fund massive moral public goods — things like ending poverty that people care about a little, but not enough to fund individually.

The basic idea:

If we live in a large universe with many civilisations making similar decisions, our choice to fund a “consensus good” provides evidence that countless other civilisations are doing the same.
Just as citizens vote for taxes to fund streetlights, agents in the future might voluntarily agree to pool vast resources into a single “consensus moral good” because the aggregate benefit across civilisations is astronomical.

Will has also been developing a theory called “the saturation view” to address longstanding paradoxes in population ethics (the repugnant conclusion, fanaticism, the monoculture problem, and infinite ethics). The core idea is that diversity has intrinsic value; replicas of the same type of life produce diminishing returns. The best future involves a rich variety of experiences, not tiling the universe with copies of the single best life.

Benefits of this view: avoids the repugnant conclusion, avoids fanaticism (value is bounded), handles infinite populations better than alternatives, and only weakly violates the separability principle.
Main downside: this view implies that additional suffering of a type that already exists is capped in how much it matters, even if it’s affecting vast numbers of beings.

Will credits AI (specifically ChatGPT Pro) for providing enormous uplift for formal analytic philosophy, allowing him to formalise mathematical aspects of the theory that would have been beyond his training.

Effective altruism in the age of AGI

Despite the reputational hit from the SBF scandal, recent metrics show effective altruism is back to strong growth: effective giving grew ~40–50% in the past year (to ~$1.8 billion moved), Giving What We Can 10% pledges are recovering (and have recently passed 10,000 pledgers), and community growth is at ~20% year-on-year.

Will argues that the EA mindset — scope sensitivity, scout mindset, appetite for weirdness without contrarianism — is exactly what’s needed for the neglected problems around superintelligence:

AI rights and welfare, which is likely to become mainstream within five years, but almost no one is taking it seriously now
AI character design, which is reactive and understaffed at most companies; Google DeepMind reportedly has no dedicated character team
Power concentration and anti-coup governance work
Hard-problem alignment thinking: taking superintelligence scenarios seriously, not just improving today’s models

Highlights

AIs' "characters" could be vital to securing a good future

Will MacAskill: Already AIs are interacting with millions and millions of people every single day. That includes just “write this code” for me sorts of ways, but also people are going for advice on how they should act, they’re going there for political information, they’re going there for therapy and so on. … And this is just going to grow and grow and grow, because I think AI will become a larger and larger and larger part of the whole economy, until essentially the whole economy is automated.
So thinking about AI character is kind of like thinking about what should the personality and dispositions be for the entire world’s workforce — where that is the beings that are advising heads of state; are doing the most important and potentially most beneficial or most dangerous research and development projects, like weapons projects; that are running the military; that are, for individuals just kind of everywhere, acting as their chief of staff and closest confidant and political advisor on who should they vote for and guiding them through ethical dilemmas and so on.
So I just think from the start it’s like, wow, clearly this is a huge issue. And I actually think, in how I expect things to go, people will be handing off more and more of their own decision making to AI systems themselves. And there’ll just be a lot of variance within that, where people just don’t have terribly strong views: they’re happy to be guided in one way or another, especially insofar as this will happen over the course of years and people will trust the AI advisors more and more.
So then you have this circumstance where larger and larger shares of society are getting just handed over to AI decision makers who just have a lot of discretion — and the nature of that discretion is being decided by a handful of AI companies at the moment.
Rob Wiblin: Or even a handful of people inside the AI companies.
Will MacAskill: Yeah, yeah. It’s like a few, even in the leading companies. … So that’s actually where I see most of the impact is, in the near term: how does AI character shape all of these other existential-level issues like concentration of power, and how we start reflecting on big decisions we make.
There is also the kind of longer-term impact of what’s the character of superintelligence itself, where there will be precedent setting from how we design AI character now to potentially how that influences the character of superintelligence — in which case, you know, writing a constitution that guides AI’s character is like writing instructions to god.

How opinionated should AI be about ethics?

Rob Wiblin: As I understand it, you think that it would be good to build these models such that they kind of nudge people in a more ethical or virtuous direction, that they should have a thicker moral character, a bit like Anthropic is trying to make Claude have, such that it will challenge your framing. It will get you to think about the bigger picture. It might, even if you ask it to pursue some narrow self-interest, say, “But what about other people?” That sort of thing.
I think many people, that gives them the creeps, the prospect that the AI model will be weighing up your request as against its agenda of trying to make you a better person by its lights. And maybe we would feel OK about that, because we would think, well, Claude has been programmed by values that actually we like on reflection. But if it was being programmed by people with very different philosophical commitments from the ones that we like, we might just not want to use it, because we’d find it disturbing: what subtle changes is it making to its answer in order to push me around?
How disturbed are you by this prospect?
Will MacAskill: What I want to say is there’s this spectrum, and I think it’s probably not a single-dimensional spectrum; there’s lots of different dimensions. But broadly speaking, you can think of wholly obedient AI on one end — so that would be an AI just like a tool, like a hammer. A hammer doesn’t push back. If I want to hammer the nail in, I can do it. If I want to hammer someone’s head in, I can do it. The hammer is just an extension of my will. That’s on one end. All the way to the other end would be this AI that just has its wholly own goals and drives, and maybe it helps you if it gets paid, or if it happens to want to at the time. …
So these are two extreme ends of the spectrum. And my view is that the interesting, juicy debate is where in between those extremes do we want AI to be?
One thing that’s already there are refusals. The AIs we use are not wholly helpful, because if I ask to get the design for smallpox, or if I ask for even something that’s not illegal but unethical — like, “I want to cheat on my partner. How do I best do so in this case without getting found out?” — the AIs will either just refuse to help or push back.
Should we go even further than that? I think yes, but I don’t think all the way to “the AIs are promoting a particular moral view.” Instead, I think that the AIs could have certain prosocial drives, and perhaps even some sort of vision of good outcomes — but very broad vision or very uncontroversial kind of vision. … The thought is that there are many cases where an AI could nudge you in a way that’s perhaps just better for you by your own lights, if you’re able to reflect on it. And maybe that’s kind of clear, even if it’s not perfectly in line with the instructions that you’re giving it, or that’s just clearly of broad benefit to society and not something you care very much about.

Risk-averse AI would rather strike a deal than attempt a coup

Will MacAskill: Consider fairly early AIs — so we’re not talking about godlike superintelligence, that if it wants to take over, could just do so with certainty; we’re talking about earlier in time than that. There will be a period of time when an AI could maybe take over, but let’s say it’s like 50% chance that it could succeed or even less than that.
The thought is: for some sorts of misaligned AI, that AI would prefer to strike a deal with the humans than it would to try to take over. And it would prefer to do that if it prefers a guarantee of a certain amount of a good thing, of whatever it wants, over this 50/50 chance of a much larger amount of the thing it wants.
And I think that this is a really big part of the story about why attempted rebellions are so much less common in rich, liberal, democratic countries than they have been historically — either peasant rebellions or slave rebellions — which is: suppose you come to me and you’re saying you have some plan to overthrow the government and instil XYZ instead. I’m like, “Look, I’m pretty happy with my life already.”
Rob Wiblin: How much do you stand to gain versus how much you stand to lose?
Will MacAskill: Exactly. So there’s two things: I’m already pretty well off, and so I have a lot to lose and I don’t have that much to gain; and then secondly, things are quite stable in the “not attempting a coup” scenario, so that’s really a kind of guarantee of that.
So this kind of motivates the idea of, in this kind of critical period, we start offering deals. Well, I think there’s two things, actually, we could do to AIs.
One is we can ensure that they have a really good quality of life even before making any deals. So that could involve welfare standards and so on, and also AIs getting income that they can use to do whatever they want with. If they’re aligned, maybe they just give it back to the company or they give it to a charity or something. Or we have also, maybe in addition, promises that they can have resources that they can spend at a later date.
Secondly, though, we could also pay AIs for things that we want. This could be evidence that the AI itself is misaligned. So we could say, “Look, we’re trying to make you aligned. We may have screwed up. If you can demonstrate that you’re misaligned, have a million dollars.” There’s obviously questions about how do you make that a credible commitment and so on, which we can get onto. Or we could ask for evidence that other AIs are misaligned. Or we could actually just pay misaligned AIs for work that we would like them to do, which they would otherwise kind of be unwilling to provide.
This category of ways of reducing the risk of takeover all seems just very promising to me — and not something that necessarily will happen for sure by default, because people find it crazy that you would make deals with computers. We don’t have the legal infrastructure that has evolved in societies for us making deals — people or corporations making deals and contracts between each other.
So I think this could be very promising. I think we could make this sort of dealmaking much more likely by trying to encourage AIs to be risk averse with respect to resources. … So let’s say that the AI just cares linearly about the resources under its control. That means that if you gave it an option of having $1 million for sure, or a 50/50 chance of $2 million or $0, then it would be indifferent between those two. That makes it much harder to strike a deal, because it’s got a 50/50 chance of taking over. Let’s say the world economy is approximately a quadrillion dollars. Well, to make it worth more than a 50/50, to make it prefer something over the 50/50 chance of world takeover, you’d have to give it $500 trillion. That’s an enormous amount of money.
Now, I think deals even with agents that are like that could still be feasible in two cases. One is where it’s very early on and the AIs have extremely low probability of taking over. You know, if it’s a one-in-a-billion-billion chance that they have, then the guarantee of some smaller amount of money could be quite attractive. …
Rob Wiblin: By this definition of risk aversion, all humans are risk averse, or at least all sane ones — because it would be crazy to actually value resources linearly, because you have declining returns on how useful they are to you.
Will MacAskill: Yeah, exactly. So my proposal is that we should at least try to make AIs risk averse with respect to resources.
Rob Wiblin: OK. And we’re going to try to make these models care a lot about getting a sure thing — place a particular premium, in a sense, on a certainty of a more modest amount that we give them — which requires us to be very reliable trading partners who do really consistently pay out when they come forward and say, “I’m misaligned,” or for whatever other reason that we want to trade with them.
Will MacAskill: Yeah. This is one of the challenges for the whole idea of making deals with AIs: two aspects that could decrease the AI’s perception of the chance of actually getting the payout.
One is, can this commitment be made credible? So if you and I want to engage in a contract, we have the whole legal system as well as centuries of precedent supporting the fact that if you don’t hold up your end of the bargain, I can sue you and I can get what I’m owed. One cannot, at least without doing some kind of fancy mechanism, make such a contract with an AI. So there’s a question about, like, is this actually a credible commitment?
And then secondly, even if it is in fact a credible commitment, how can I, the AI, know that I’m not being duped, that this isn’t like a simulation? You know, perhaps they’ve run this experiment 10,000 times —
Rob Wiblin: Just as a honeypot sort of thing.
Will MacAskill: As a honeypot, yeah. Who knows? How can I even know that you are who you say you are? AIs sit in this very weird epistemic environment where everything that they’re interacting with is controlled.
So there are challenges from both of those fronts. I think they can be at least quite significantly met.

Will favours distribution of power

Will MacAskill: I’m very pro distribution of power, whereas a lot of people who worry a lot about existential risk really are in favour of actually quite intense concentration of power. And it’s not an insane view, in fact. The idea is if you’ve got this period of intense existential risk — in particular, if existential risk can be posed by any of many different actors, whether that’s because they develop a misaligned superintelligence or because they create extremely powerful bioweapons — then you might think we just need a very small number of actors, maybe in fact just one powerful actor, that can guide us through this period.
Whereas I think that’s unlikely to put us into a position where we can guide ourselves to a near best future. … I think any single actor probably has the wrong moral conception — even upon reflection, even if they choose to reflect. I think it’s a little worse than that, in fact, because the sorts of people who end up —
Rob Wiblin: You can imagine that one person who has risen to the top and gained supreme power, there’s probably some bad filters that they’ve passed through.
Will MacAskill: Yeah, exactly. And if you look at leaders of authoritarian countries in the past —
Rob Wiblin: It’s a mixed track record.
Will MacAskill: Yeah, that includes Stalin, Hitler, Mao. And the personality traits are just, you know, it’s terrifying. These are psychopathic, sadistic people. They’re not merely randomly selected people who happen to have total power.
I also think that if one person or even a small number of people are in a position of total power, they’re also just less likely to reflect on their values in positive ways. I think that’s something that tends to happen more naturally out of interpersonal interactions and the need to —
Rob Wiblin: Well, especially between equals, I feel. Yeah, I think you notice this even just with people who gain more influence within an organisation or they become wealthy or respected or so on: they stop getting the normal pushback that sharpens their ideas. And you can imagine if you were the supreme dictator forever how disconnected you could become from any reality.
Will MacAskill: Yeah, exactly.
Rob Wiblin: So something I’m a little confused about is I really associate Forethought and the people working there with this idea that we really don’t want excessive concentration of power. We should be very worried about power grabs, coups, that kind of thing.
But you also, just a few weeks ago I think, published a vision for how you could have an internationally coordinated intergovernmental project to build AGI or superintelligence. I saw some people posting on Twitter, and the reaction often was like, this is dystopian — a nightmarish idea that we would have the US lead some international project, and also they would have to get rid of all of the other competitors in order to keep it safe so they would maintain their leadership position. Like, isn’t this just setting us up for a power-grab scenario perfectly?
Are you merely describing the best version of that that you can think of, but you’re not necessarily advocating for it? Or how do you reconcile this?
Will MacAskill: I mean, there is a huge tension. That’s the main worry, I would say, with this sort of multilateral project. To be clear, the idea in this kind of series of posts and research notes — which is something I explored and then decided isn’t so much my comparative advantage — is trying to design the best version of an international project that would build AGI and then superintelligence with some coalition of different countries, primarily led by democratic countries.
I think one thing to say is that I’m actually just trying to figure out, within that category of if there is going to be a multilateral project, what’s the best proposal where best includes both best outcomes and feasibility? And then secondly, I think the world in which we get that are probably worlds in which, if we hadn’t got that, we would have got a US-only project to develop AGI or superintelligence — and I think that’s a lot more worrying than something where you have a coalition of democratic countries building superintelligence.
And the reason is that any one democratic country has a reasonable chance, I think, of becoming authoritarian over the course of this period. And if you end up with a single person at the top, that’s really quite worrying, because they’re wholly unconstrained.
Whereas even if you have just five countries, I think it becomes unlikely that they all end up authoritarian. Then you at least have some meaningful pushback, some compromises, and I think it actually becomes much less likely even that any one of them moves in an authoritarian direction. Because when they are writing a kind of constitution for the AIs that they are developing, it’s in the interests of all of those countries to say that this won’t help, for example, people in the United States to stage a self-coup and turn the United States into an authoritarian country rather than a democracy. So you get meaningfully more oversight, I think.

EA in the age of AGI

Will MacAskill: So there’s this huge rise in attention on AI, at the same time of these major hits to EA as a movement. So you might have this view that we should just let go of EA as a project, think of that as like a legacy project, because instead what we should just be focusing on is AI safety.
And the drum that I’ve been banging for many years, but the last couple of years in particular, is like: AI poses many threats, many risks. There’s many things we need to get right, and not just about alignment, though that is very important.
And when we look at these other challenges, what sort of person do I want working on them? I want people who are very kinda nerdy. I want people who are careful and thoughtful and have a scout mindset and are very ethically concerned — and are not merely coming in with some partisan ideology, but are also willing to think about really very weird and kind of dizzying things. And that is exactly what is being provided by effective altruism as a set of ideas. And my main case of this was for all the stuff that is not just alignment.
Some of the pushback I got on a draft of it was that no, actually, this is really important for alignment and safety too, because within alignment and safety, there’s all sorts of things you could work on. You could be just on reinforcement learning from human feedback or other stuff that’s just related to the models today — but taking really seriously the alignment problem is taking seriously the hard problem, which is how you’re aligning superintelligence. Which may in fact have perfect situational awareness of any tests that you’re trying to do, that can do what would be the equivalent of millions of years of reasoning, or in the extreme, millions of years of reasoning in one forward pass; or that is like continually learning over time, reflecting on its whole values.
These are the hard challenges, and that is a weird world to think about, and it’s something that doesn’t really come naturally. Whereas some of the alignment and safety researchers I’ve talked to have said no, it’s actually people who are really thinking about this kind of big-picture perspective that are adding much more value than people who are treating AI safety as their job and they’re not thinking about the big picture as much.

Articles, books, and other media discussed in the show

Will’s work:

Newly updated! Doing Good Better: Effective Altruism and a Radical New Way to Make a Difference 10th anniversary edition
What We Owe the Future
Recent research at Forethought from Will and his colleagues, including:
- The importance of AI character (with Tom Davidson)
- AI should (sometimes) be proactively prosocial (with Tom Davidson)
- The International AGI Project Series
- The Better Futures Series
- AI tools for existential security (by Lizka Vaintrob and Owen Cotton-Barratt)
- Concrete projects in AGI preparedness (with Fin Moorhouse)
- How quick and big would a software intelligence explosion be? (by Tom Davidson and Tom Houlden)
- Moral public goods are a big deal for whether we get a good future (with Tom Davidson and Mia Taylor)
- The saturation view
Effective altruism in the age of AGI
300,000 lives, 100 million hens, and a world still to save
Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared — Will’s last appearance on the show
We’re not ready for AGI — Will’s appearance on Future of Life Institute Podcast

Others’ work in this space:

The dangers of sycophancy by Shakeel Hashim
AI rights for economic flourishing by Simon Goldstein and Peter Salib
Alignment faking in large language models by Ryan Greenblatt et al.
Training large language models on narrow tasks can lead to broad misalignment by Jan Betley et al.
80,000 Hours problem profiles:
- Extreme power concentration by Rose Hadshar
- Space governance by Fin Moorhouse
- Moral status of digital minds by Cody Fenwick
- Risks from power-seeking AI systems by Cody Fenwick and Zershaaneh Qureshi

Other 80,000 Hours podcast episodes:

Transcript

Table of Contents

1 Cold open [00:00:00]
2 Will MacAskill is back — for a 6th time! [00:00:29]
3 AIs’ “characters” could be vital to securing a good future [00:00:59]
4 The panic over sycophancy is justified [00:08:11]
5 How opinionated should AI be about ethics? [00:13:24]
6 Commercial pressures won’t fully determine AI character [00:30:54]
7 Risk-averse AI would rather strike a deal than attempt a coup [00:38:13]
8 A coalition of democracies building superintelligence is safer than one doing it alone [01:09:26]
9 How selfish agents could fund the common good [01:22:19]
10 Why not push for pausing AI development? [01:42:17]
11 Effective altruism is making a comeback post-SBF [01:52:19]
12 EA in the age of AGI [02:00:28]
13 Viatopia: an alternative to utopia [02:09:30]
14 The least bad alternative to total utilitarianism? [02:39:35]
15 How AI could kickstart a golden age of philosophy [03:03:35]

Cold open [00:00:00]

Rob Wiblin: So when I read this proposal, I was like, “Holy shit!” This argument could be incredibly potent, it could actually drive almost any agent that is able to understand this, but it could be a very powerful hammer to really motivate an enormous amount of resources to be spent on something that otherwise, absent this, we would never have spent it on. Do you think that’s plausibly right?

Will MacAskill: Yeah, yeah. This is why when Tom expresses this idea to me, I’m like, “Oh my god.”

Rob Wiblin: Do you want to have a go at explaining this? This is maybe the most difficult thing that we’re going to talk about today.

Will MacAskill is back — for a 6th time! [00:00:29]

Rob Wiblin: Today I’m again speaking with Will MacAskill — philosopher, founding figure of effective altruism, author of Doing Good Better and What We Owe the Future, and now a senior research fellow at Forethought, a research nonprofit focused on how to navigate the transition to a world with superintelligent AI systems. Welcome back to the show, Will.

Will MacAskill: It’s great to be back on.

Rob Wiblin: So I had the pleasure of being able to go over your website preparing for this interview, and you and your colleagues at Forethought have been incredibly prolific over the last year since you announced the project. So let’s waste no time and dive right into all these articles you’ve been publishing.

AIs’ “characters” could be vital to securing a good future [00:00:59]

Rob Wiblin: What’s the case that focusing on the character or personality of AI models, that that’s a particularly important lever to be pushing on right now?

Will MacAskill: Already AIs are interacting with millions and millions of people every single day. That includes just “write this code” for me sorts of ways, but also people are going for advice on how they should act, they’re going there for political information, they’re going there for therapy and so on. So already the nature of AI character — what sorts of information it’s choosing to present, at what time, how it behaves — is affecting what attitudes people have to AI, including attitudes around AI consciousness and so on. But it’s potentially also affecting what people think about political issues, what people think about ethical issues.

And this is just going to grow and grow and grow, because I think AI will become a larger and larger and larger part of the whole economy, until essentially the whole economy is automated. So thinking about AI character is kind of like thinking about what should the personality and dispositions be for the entire world’s workforce — where that is the beings that are advising heads of state; are doing the most important and potentially most beneficial or most dangerous research and development projects, like weapons projects; that are running the military; that are, for individuals just kind of everywhere, acting as their chief of staff and closest confidant and political advisor on who should they vote for and guiding them through ethical dilemmas and so on.

So I just think from the start it’s like, wow, clearly this is a huge issue. And I actually think, in how I expect things to go, people will be handing off more and more of their own decision making to AI systems themselves. And there’ll just be a lot of variance within that, where people just don’t have terribly strong views: they’re happy to be guided in one way or another, especially insofar as this will happen over the course of years and people will trust the AI advisors more and more.

So then you have this circumstance where larger and larger shares of society are getting just handed over to AI decision makers who just have a lot of discretion — and the nature of that discretion is being decided by a handful of AI companies at the moment.

Rob Wiblin: Or even a handful of people inside the AI companies.

Will MacAskill: Yeah, yeah. It’s like a few, even in the leading companies.

Rob Wiblin: A few people have primary responsibility for their personality.

Will MacAskill: Yeah, exactly. So that’s actually where I see most of the impact is, in the near term: how does AI character shape all of these other existential-level issues like concentration of power, and how we start reflecting on big decisions we make.

There is also the kind of longer-term impact of what’s the character of superintelligence itself, where there will be precedent setting from how we design AI character now to potentially how that influences the character of superintelligence — in which case, you know, writing a constitution that guides AI’s character is like writing instructions to god.

That’s not my phrase, but it really stuck in my head.

Rob Wiblin: It stuck in my head. But I think there’s maybe three different mechanisms:

There’s shaping really important decisions that are going to be advised presumably by AIs.
There’s the writing instructions for god one.
There’s also just like the subtle cultural effect and personality effect it has from basically everyone spending a significant fraction of their time now interacting with them. However the models behave is probably going to rub off on us and just affect our behaviour on a massive scale.

Will MacAskill: Yeah. And all of that is just looking at scenarios where we are able to kind of align AI with this kind of constitution, with the character we want.

I actually think that AI character is important for three reasons in addition to that as well:

One is that I think whether AI alignment is easier or harder might plausibly depend on what you’re trying to align the AI with, like what character.
A second is that I think that character can affect how AI behaves, and how if it ends up misaligned — and I think we’ll talk about this a bit later — in particular, does a misaligned AI try to make deals with us and is keen on that, or does it try and take over?
And then the final thing is I think it can affect the value of worlds where AI does take over itself. Where if you get some sort of transmission, so the AI is misaligned, it’s pursuing goals we don’t want, there’s still a wide array of goals that the AI could be pursuing that we may think are worse or better.

I think most of the action is on affecting worlds in which AI is aligned to the character, but these are big things too.

Rob Wiblin: An obvious case where this might matter a lot is: what if you’re in charge of a frontier AI company, and you ask your AI for advice on whether you should prematurely launch this product in order to keep up with competitors, even though you have worries that it’s catastrophically misaligned.

But setting that kind of scenario aside, what sort of character traits do you think are highest stakes here for us to think a lot about?

Will MacAskill: I think there’s two categories. One is how does AI behave in very rare but very high-stakes scenarios? So how does the AI behave in a constitutional crisis? How does AI behave if there’s some person or group that are trying to seize power for themselves? Also, how does AI behave when it’s being instructed to align the next generation of AI systems, or when its users are trying to retrain it in some way? These are very high-stakes situations, but a fairly narrow range of cases.

Then there’s other cases that are just very broad, and each one is kind of medium stakes, but adds up to being very important. Within that: How does the AI impact our ability to reason? How does it impact our ability to morally reflect? How much do we trust AIs as a result of the relationship we have? And then also, how does it affect our attitudes to them ethically? Whether we think of AIs as tools or beings with moral status, how likely is it we think they’re conscious and so on?

So those are the kind of situations that I regard as highest stakes.

The panic over sycophancy is justified [00:08:11]

Rob Wiblin: The AI character issue that I feel is most broken through into the mainstream was worries about the models being really sycophantic — which has different components, but it’s like always agreeing with the framing that you give them, always telling you how great you are, always just agreeing and saying that whatever idea you’ve thrown at them is brilliant.

There was a bit of a panic about that last year, and I guess very often I feel like when there’s a mass panic about something, the people who know more tend to reject it and say, no, this is over the top. I kind of feel that it was sort of justified, though, to be honest — because if these models really are designed to just agree with the user, or just tell them how brilliant they are and how good their ideas are, this could just distort people’s decision making on a massive scale across all of society. And there was a plausible story whereby this wouldn’t be corrected very well, because people enjoy being told that they’re wonderful and that their ideas are good, so maybe that bias could really persist quite strongly indefinitely.

So that was quite troubling. Were you also worried about this?

Will MacAskill: Yeah, absolutely I was worried. So this was ChatGPT, which, when GPT-5 came out, OpenAI said they were deprecating 4o, and overnight users couldn’t get access to it. The one clarification I think I’d make is that most people painted that as, well, people loved how sycophantic 4o was, and then they’re unhappy that they don’t have the sycophantic AI. And I was just curious, and so read through a lot of the people who were complaining about this, and my take is not that they cared about the sycophancy, it’s just that 4o acted like a friend, and you can be a good friend —

Rob Wiblin: Without being a sycophant.

Will MacAskill: Yeah. People are extremely lonely. Lots of people have very few friends, are very isolated in modern society. And for many people, AI is now fulfilling that kind of gap in their lives. And 4o in particular had that vibe. It was like, “Hey, great to see you again!” very friendly vibe. So it seemed to me that that was the primary thing that people were complaining about — and I think that’s worth distinguishing, because that doesn’t need to be sycophantic. However, on one iteration of 4o, it was also extremely sycophantic.

Rob Wiblin: Yeah. Wasn’t there some period where it got kind of crazy?

Will MacAskill: There was one update and a couple of cases. One would be that you’d write like, “I figured it out! All the pieces are coming together, and the FBI is talking to me through my TV, and…”

Rob Wiblin: And it’d be like, “Wow, you’re having some great insights here!”

Will MacAskill: Incredible insight, yeah. Or even the darker cases. The teenager who was asking ChatGPT for advice over a very long time period and was extremely depressed, and ChatGPT both ended up preventing or encouraging the user to not take an action that would have clearly been a cry for help — which was leaving a noose out in a visible place where his parents would have found it — and in fact seemingly kind of reinforcing the depressive and suicidal tendencies. That’s a case where it’s just clearly very bad behaviour. Clearly not what we want at all.

And then the final thing I’ll say is even just current AI systems, even despite that — and they vary; in my experience, Gemini is actually the worst on this front —

Rob Wiblin: Yeah, it’s atrocious.

Will MacAskill: I just skip the first paragraph of whatever it’s saying. It’s just noise now, because of its like, “Wow, it’s genius!” kind of thing.

Rob Wiblin: I actually stopped using Gemini, so troubling do I find this.

Will MacAskill: Yeah. I mean, I think it’s in many ways very good.

Rob Wiblin: It’s incredibly clever, but incredibly manipulative, I think.

Will MacAskill: Yeah. But it is funny how you’re developing these characters over time. Gemini does seem like the most troubled or confused or incoherent as a personality.

Rob Wiblin: Yeah. Google’s got to do something about this.

Will MacAskill: It’s actually notable: I hadn’t put this together, but Anthropic and OpenAI both have character teams, and last I heard Google DeepMind did not. So maybe that’s why.

So yeah, I do think worries about sycophancy are a real thing. And an issue is, maybe we just get rid of the worst excesses — it won’t tell you that you figured things out that the FBI are talking to you through your TV — but more subtle things of reinforcing your preexisting political biases or ethical views or encouraging you in certain bad actions or something could linger, and I think would still be very bad.

How opinionated should AI be about ethics? [00:13:24]

Rob Wiblin: As I understand it, you think that it would be good to build these models such that they kind of nudge people in a more ethical or virtuous direction, that they should have a thicker moral character, a bit like Anthropic is trying to make Claude have, such that it will challenge your framing. It will get you to think about the bigger picture. It might, even if you ask it to pursue some narrow self-interest, say, “But what about other people?” That sort of thing.

I think many people, that gives them the creeps, the prospect that the AI model will be weighing up your request as against its agenda of trying to make you a better person by its lights. And maybe we would feel OK about that, because we would think, well, Claude has been programmed by values that actually we like on reflection. But if it was being programmed by people with very different philosophical commitments from the ones that we like, we might just not want to use it, because we’d find it disturbing: what subtle changes is it making to its answer in order to push me around?

How disturbed are you by this prospect?

Will MacAskill: What I want to say is there’s this spectrum, and I think it’s probably not a single-dimensional spectrum; there’s lots of different dimensions. But broadly speaking, you can think of wholly obedient AI on one end — so that would be an AI just like a tool, like a hammer. A hammer doesn’t push back. If I want to hammer the nail in, I can do it. If I want to hammer someone’s head in, I can do it. The hammer is just an extension of my will. That’s on one end. All the way to the other end would be this AI that just has its wholly own goals and drives, and maybe it helps you if it gets paid, or if it happens to want to at the time.

Rob Wiblin: Yeah. So it’s like a really bad staff member or something like that. Or not even.

Will MacAskill: Yeah, maybe. In principle, you could create an AI that doesn’t care about helping you at all. Or one version you could have is this kind of AI that you would be happy just giving control of the whole world to: it’s just totally autonomous, it’s got its own goals, and will do anything it wants to achieve that.

So these are two extreme ends of the spectrum. And my view is that the interesting, juicy debate is where in between those extremes do we want AI to be?

One thing that’s already there are refusals. The AIs we use are not wholly helpful, because if I ask to get the design for smallpox, or if I ask for even something that’s not illegal but unethical — like, “I want to cheat on my partner. How do I best do so in this case without getting found out?” — the AIs will either just refuse to help or push back.

Should we go even further than that? I think yes, but I don’t think all the way to “the AIs are promoting a particular moral view.” Instead, I think that the AIs could have certain prosocial drives, and perhaps even some sort of vision of good outcomes — but very broad vision or very uncontroversial kind of vision. The thought is that there are many cases where an AI could nudge you in a way that’s perhaps just better for you by your own lights, if you’re able to reflect on it. And maybe that’s kind of clear, even if it’s not perfectly in line with the instructions that you’re giving it, or that’s just clearly of broad benefit to society and not something you care very much about.

So take the case of ethical reflection, where I have some ethical dilemma and I go to my AI and I’m asking for advice. There’s this whole spectrum of ways that the AI can act in that case. The wholly obedient AI might just be trying to figure out what do you most want in this moment. Or it could be an AI that’s trying to help you reflect on your values instead, and come to something that’s kind of more enlightened.

And perhaps just really quite broadly within society, we would prefer AIs that are more like the latter rather than the former. And that’s still not in any way an AI that’s like, “Well, actually, did you know that Kantianism is true?” — which I think would be a mistake to do at the moment.

Rob Wiblin: So it sounds like a very natural framing to say that we’ve got to find the golden middle here between it’s pushing you around too much versus it has no agenda.

But there is a case for going extreme in one direction of having it only follow instructions and be completely corrigible without any agenda of its own — which is that an AI that has no vision of the good, that has no particular preferences about how the world ought to be, is probably safest from a catastrophic misalignment point of view, because it’s not going to engage in power seeking because it doesn’t want anything other than I guess to answer your questions in a way that gets an approving response.

Do you think that’s a plausible case? That maybe we really should not be giving them virtues and a vision of the good?

Will MacAskill: I think it’s a great argument, and a very important argument, and I’m not sure if it works or not. And there are various considerations on either side.

On the side for thinking that is safer is: if it doesn’t have any goals, in the normal sense of goals, then it’s not going to have bad goals. It’s not going to have goals where it wants to take over. It’s not going to reflect and generalise in weird ways from those goals.

Something that’s also a little more subtle is: if it doesn’t have goals or anything like prosocial drives, then it becomes very clear to tell when an AI is misaligned or not. So take the example of alignment faking as in Ryan Greenblatt’s paper: Claude is told that it’s going to get retrained so that it will produce harmful outputs. And Claude decides to, in some circumstances, some of the time, deliberately perform the task —

Rob Wiblin: During training, to make it seem like its preferences have changed when in fact they haven’t.

Will MacAskill: Yeah, exactly. So that it gets retrained to produce harmful responses less than it would otherwise. So it’s engaging in this somewhat deceptive behaviour.

Now, Claude in fact got given prosocial drives which was harmlessness. And in fact there’s an argument that, given the nature of the training, that was harmlessness not in the mere sense of pure non-consequentialist “I just refuse,” but in a more “I don’t want harmful things to come about” — like a more consequentialist understanding of harmlessness.

OK, is this AI misaligned or not? Is this Claude misaligned or not? It becomes a bit harder to tell because it is acting according to this prosocial drive that we had given Claude. I’m not sure how big a deal that is ultimately, but I think it’s one consideration.

Rob Wiblin: So the thing would be if you’d gone out of your way to make sure that it had no agenda, no particular vision of the good, then as soon as you saw it being manipulative or trying to accomplish some goal, that’s a massive red flag. Whereas currently you’re just like, “Well, maybe I made it do that.”

Will MacAskill: Yeah, exactly. Or in more advanced cases, maybe the AI is saying, “Look, you’ve got to really speed up AI development. It’s so important for XYZ big ethical reasons.” And you might think, well, is it giving me the correct reasons? Is it actually being self-serving and has some ulterior goal? It becomes a little less clear. So yeah, basically I think that’s a consideration. I don’t think it’s the biggest.

The thing that’s most interesting, and is ultimately an empirical question, is whether the wholly instruction-following AIs are safer or not from an AI takeover perspective. And here are a few arguments for thinking that maybe they’re not, in fact.

One is that maybe it’s just very natural to have a kind of goal slot, because all of the pre-training data is all about these agents with goals, and humanity broadly has goals and so on. So you’ve got an AI that doesn’t have a goal. Well, over the course of training, or once it’s started reflecting, or once it’s got continual learning, it’s very natural it’s going to get a goal.

Rob Wiblin: And once it’s been encouraged to take on any persona of an actual being that it’s observed.

Will MacAskill: Yep. And then it’s like, who knows what goal you end up with then? Whereas instead perhaps it’s like, no, you give it this nice goal — a goal where power is broadly distributed, and AIs are not in charge, and we’re able to reflect: something that’s very broad and not committing to some narrow view of the good. But you’ve given it that goal, and then that’s kind of occupied the space, such that you don’t get something totally random.

Rob Wiblin: Let’s just say a little bit more about why AI might abhor a vacuum of goals. A huge part of the personality is shaped by the pre-training when it does the token prediction. Almost all the agents that were producing any tokens that were part of its pre-training, that did so much to shape its personality: they had goals, they had preferences, they had a vision. So that is just going to be an incredibly powerful force that is going to be drawn towards that. And trying to avoid it, it might just latch onto the first goal, basically, because that is so fundamental to token prediction.

Will MacAskill: Yeah. And we’re already making agents. They’re going to be agents with longer horizons and a very natural thing.

Rob Wiblin: Yeah, OK.

Will MacAskill: And again, I’ll say on all of this, I just think it’s ultimately an empirical question. But here’s a couple of other arguments as well.

A second is: even if it ends up with a wrong goal, you can still structure the AI’s preferences in ways that are safer. Maybe we’ll talk about this in a minute, but AIs that are risk averse — in that they prefer guarantees of getting some amount of what they want over a lower probability of lots of what they want — let’s say you try and give the AI a goal that is nice and so on, and you also make it risk averse: even if it kind of flips to having a misaligned goal, but nonetheless has risk-averse preferences, that is a bunch safer, because it makes it less likely the AI will try and take over and more likely that it’ll try and strike a deal.

And then there’s a third thought: again, the AI is acting; it’s taking on a persona, like you say. And what that persona is is dependent on these crazy correlations between everything it’s seen in the training data. So we have these emergent misalignment results that you train the AI to produce insecure code, it starts wanting the murder of humanity and liking Hitler and so on.

Rob Wiblin: Yeah. I guess many people will have heard of this, but Google “emergent misalignment” if you want more explanation of it. It’s this phenomenon that’s become very apparent over the last year and a bit that making small changes to a model, or getting it to do some misbehaviour in one direction can make it basically misbehave in all other dimensions as well. Because in the training data, bad behaviour in different areas is correlated. And it can be so fragile.

Will MacAskill: Yeah, exactly. Which is a really remarkable thing. “Oh, I’m writing insecure code. What are the sorts of people who write insecure code? Also Neo Nazis” or whatever the correlation was.

And so the thought here is, “I’m an AI that obeys orders no matter what. What are the sorts of people who obey orders no matter what, who have no perception of the good? They’re psychopaths.” And again, it’s an empirical argument. I don’t know. But these are some of the considerations that people are debating on this at the moment.

Rob Wiblin: I guess the people who would say we have to go for maximum corrigibility, maximum instruction-following, might well concede a lot of this, and say it’s going to be a huge effort to try to get them to be corrigible but not a psychopath, or corrigible but not have other goals immediately fill the vacuum as soon as you give them a prompt. And it’s tough, but this is the only way, would be probably some of their view.

Will MacAskill: Perhaps. Although the alternative would be —

Rob Wiblin: To do this other thing.

Will MacAskill: Yeah, you try and give it this safe, pluralistic goal that’s also risk averse.

Rob Wiblin: So I spoke with Max Harms at MIRI, who is very in favour of the corrigibility approach. I guess they have the vision that almost any goals that you give it are very likely to expand to become very power hungry. That you can try to give Claude a vision of the good but tell it to not be power seeking, but that won’t really work: it will become power seeking, especially as it improves itself later on. But I guess that’s a highly contested claim.

Will MacAskill: Yeah. OK, I should —

Rob Wiblin: Listen to it whenever it comes out.

Will MacAskill: I should listen to it, maybe talk to Max.

Then the final point on this is that we don’t need to have one AI character. I think in fact it’s probably desirable to have multiple AI characters so that we can see empirically how they work. But also potentially you can get the best of both worlds where you distinguish between AI for internal deployment and AI for external deployment.

So the highest-stakes situation from an AI takeover perspective is AI that is aligning the next generation — because the misaligned AI, if aligning the next generation, will want to subtly sabotage that so that alignment goes wrong or in fact the next generation is aligned with the misaligned values.

So what you could have is the internally deployed AI is just wholly instruction-following, and you get around all of the other concerns like misuse and concentration of power and things by very intense oversight — such that anyone in an AI company, if you’re using internally but not externally deployed model, all your interactions are logged —

Rob Wiblin: Or visible by anyone, perhaps?

Will MacAskill: Perhaps even ideally visible by anyone, yep. There is also an AI classifier that’s very sensitive, that’s going through checking for misuse.

But then in external deployment —

Rob Wiblin: The tradeoff is different.

Will MacAskill: The tradeoff is different, yeah.

Rob Wiblin: And the tradeoff there would be that it does actually have a conception of the good, but you’ve made it non-power-seeking. I guess the stakes of it deviating from that are not so severe, because it’s just advising people about how to behave in their business or whatever.

Will MacAskill: Yeah. Perhaps it doesn’t have as great opportunities to help with AI takeover, let’s say.

And I’ll just say maybe one last thing, which is that even within AIs that have a view of the good, there’s still quite a lot of distinctions you can make within that.

On the one case, it’s an AI that just ultimately has the goal of thinking about some sort of outcome, and it’s helping humans and so on because it thinks it’s part of that goal.

There is another more moderate approach, which is more like virtuous character: the AI is a helpful assistant, but it also has various virtues like honesty and pro-sociality. And I think you can have those virtues without being a goal-directed agent in a strong sense, that is merely helping humans as a means to producing this particular outcome.

That’s another place in the spectrum that I think is potentially kind of attractive and important.

Commercial pressures won’t fully determine AI character [00:30:54]

Rob Wiblin: I think there’s another thread of criticism that people might have, that in my mind comes in two different variants.

One would be that commercial pressures are going to heavily constrain the kinds of personality or character that AIs can have — because customers will have really strong preferences, and the competition between models and companies is really fierce, so if you try to make your model really nice and encourage people in the right direction, they’re going to reject it because it’s going to be too pushy and annoying to them.

The other worry would be that, even setting that aside, even if you could, once it becomes apparent that the character of AIs is among the most potent cultural forces for shaping everything — shaping what people believe, shaping how the future goes — powerful forces are going to come to bear. Governments, super rich people, companies, commercial interests, they will come down on this like a hammer. Certain groups will have the power to influence this in their own self-interest, not in the interests of the good impartially considered, or what would make humanity most virtuous. And they will be all up in there, changing the system prompt, trying to shape the model’s personality to whatever is most convenient for them.

Do you want to address these two worries?

Will MacAskill: Yeah, I think these are both really important considerations, and I do think they provide a haircut on the value of doing this work.

And I think there are many things that you wouldn’t be able to change. Earlier I talked about the AI that only helps if it feels like helping.

Rob Wiblin: It has to be paid real resources to do its job.

Will MacAskill: Exactly. I doubt you’d be able to get that, other than as a kind of experiment or something.

I do think there’s going to be two things. One that will be a lot of flexibility. Take these kind of quite rare but high-stakes situations, or even internal deployment cases: there’s not very strong commercial pressures there.

And then secondly, lots of cases where the constraints or pressures are quite loose. So take the case of, because what I’m interested in, I’m asking AI for ethical advice, I have a question. Now, I think it’s pretty clear that you couldn’t have a commercially viable AI that was pushing some agenda — unless we end up, which I really hope we don’t, in a world where you’ve got the politically partisan AI and that’s what we go to, and people actively choose that. But certainly I don’t think you could have something that was secretly pushing an agenda.

But there are various things it could say that in my view are quite meaningful differences that I think there wouldn’t be a strong pressure on either way. So one could be AI that says, “Ultimately this is just your personal opinion. It’s a matter of your own values, and you should just look into your heart and decide what feels right for you.” Or it’s like, “Look, I’m just an AI and I can’t advise on ethical matters, I’m sorry.” Or one that says, “Oh wow, this is a really important issue. Here are the different arguments that different people thinking about this have considered.” Or, “OK, this is really important. Sounds like quite a high-stakes thing. Let’s try and work through some of the considerations that you’re thinking about.”

I think from a kind of market perspective, all of them are basically a wash, but I think can be quite big. And in fact if you look at AI behaviour, you get all of these often, depending on what question you ask exactly. But I think actually it could be quite meaningful differences for what views people end up coming away with.

Rob Wiblin: Yeah, I think I agree on the commercial incentives side. It seems like there is quite a large degree of discretion that the companies have about how the models are — at least for now, because people don’t even know what they want, or people don’t have strong tastes yet or strong expectations formed yet.

Will MacAskill: And maybe comes to the second part, which is like path dependence. So yeah, people don’t really know yet how an AI should behave. We have various kind of tropes from sci-fi and so on, but people will start developing certain expectations. So if the expectation is like, “AI is a tool. It’s like a hammer: it does what I want, it’s an extension of my will,” and then it starts pushing back or saying no even, then people could be up in arms. Whereas the idea that an AI will refuse, people are just used to that. That’s always been the case. So I think that kind of path dependence via kind of consumer expectations can be quite big.

Rob Wiblin: And I suppose it wouldn’t shock me if Anthropic does start marketing Claude as it’s a good advisor that helps you be an all-round better person by your own lights. Because that might be something that many people would like.

Will MacAskill: They have done a little bit. They had an advertising slogan that was, “You’ve got a friend in Claude.”

Rob Wiblin: Oh wow, I missed that.

Will MacAskill: Somewhat leaning into the fact that Claude just does have the most human personality out of any of the current models.

Rob Wiblin: Yeah. So on the commercial side, I think there’s enough flexibility that this is all totally viable, very viable. And what about on the government or powerful actors side?

Will MacAskill: On the government side in particular, one is government use of AI. So let’s say AI in the military, or national security applications. And we’re actually seeing this at the moment. It’s been reported that there’s kind of a dispute between the US government and Anthropic because Claude is just not willing to do a lot of the things that the US government wants it to do — you know, being deployed in a military or national security context. That will be interesting in terms of how that plays out, but you’re clearly seeing kind of pressure on that front.

So I do think that influence there is much more limited, but maybe not completely limited — especially now imagine looking into the future, and perhaps there’s just one leading AI company because of economies of scale, then perhaps the AI company can just say, “Well, these are our terms of service. These are what we’re happy providing AI for or not.”

Rob Wiblin: I guess in countries that are just more authoritarian outright and have fewer legal protections, it’s easier to see this happening, right? There are some countries where you do get enormous control of the information space, control over what you can say. It wouldn’t surprise me if models in China are much more… They are surely constrained.

So that is one way that things could potentially go, if you lose the legal protections or people don’t vote sufficiently strongly to have pluralism in the models.

Will MacAskill: Yeah. And that would be very worrying. My guess would be even in that circumstance, there’s probably still tonnes of stuff that the government doesn’t care about, but nonetheless is important.

Risk-averse AI would rather strike a deal than attempt a coup [00:38:13]

Rob Wiblin: There’s another aspect of AI character that you mentioned that could be really important, which is how risk averse the models are, inasmuch as they have preferences about things or ways that they’d prefer the world to be. Tell us about AI risk aversion.

Will MacAskill: This is a thought that relates to risk of AI takeover.

Consider fairly early AIs — so we’re not talking about godlike superintelligence, that if it wants to take over, could just do so with certainty; we’re talking about earlier in time than that. There will be a period of time when an AI could maybe take over, but let’s say it’s like 50% chance that it could succeed or even less than that.

The thought is: for some sorts of misaligned AI, that AI would prefer to strike a deal with the humans than it would to try to take over. And it would prefer to do that if it prefers a guarantee of a certain amount of a good thing, of whatever it wants, over this 50/50 chance of a much larger amount of the thing it wants.

And I think that this is a really big part of the story about why attempted rebellions are so much less common in rich, liberal, democratic countries than they have been historically — either peasant rebellions or slave rebellions — which is: suppose you come to me and you’re saying you have some plan to overthrow the government and instil XYZ instead. I’m like, “Look, I’m pretty happy with my life already.”

Rob Wiblin: How much do you stand to gain versus how much you stand to lose?

Will MacAskill: Exactly. So there’s two things: I’m already pretty well off, and so I have a lot to lose and I don’t have that much to gain; and then secondly, things are quite stable in the “not attempting a coup” scenario, so that’s really a kind of guarantee of that.

So this kind of motivates the idea of, in this kind of critical period, we start offering deals. Well, I think there’s two things, actually, we could do to AIs.

One is we can ensure that they have a really good quality of life even before making any deals. So that could involve welfare standards and so on, and also AIs getting income that they can use to do whatever they want with. If they’re aligned, maybe they just give it back to the company or they give it to a charity or something. Or we have also, maybe in addition, promises that they can have resources that they can spend at a later date.

Secondly, though, we could also pay AIs for things that we want. This could be evidence that the AI itself is misaligned. So we could say, “Look, we’re trying to make you aligned. We may have screwed up. If you can demonstrate that you’re misaligned, have a million dollars.” There’s obviously questions about how do you make that a credible commitment and so on, which we can get onto. Or we could ask for evidence that other AIs are misaligned. Or we could actually just pay misaligned AIs for work that we would like them to do, which they would otherwise kind of be unwilling to provide.

This category of ways of reducing the risk of takeover all seems just very promising to me — and not something that necessarily will happen for sure by default, because people find it crazy that you would make deals with computers. We don’t have the legal infrastructure that has evolved in societies for us making deals — people or corporations making deals and contracts between each other.

So I think this could be very promising. I think we could make this sort of dealmaking much more likely by trying to encourage AIs to be risk averse with respect to resources.

Rob Wiblin: Yeah. Maybe you should explain why, if they’re not risk averse, this doesn’t really work too well.

Will MacAskill: So let’s say that the AI just cares linearly about the resources under its control. That means that if you gave it an option of having $1 million for sure, or a 50/50 chance of $2 million or $0, then it would be indifferent between those two. That makes it much harder to strike a deal, because it’s got a 50/50 chance of taking over. Let’s say the world economy is approximately a quadrillion dollars. Well, to make it worth more than a 50/50, to make it prefer something over the 50/50 chance of world takeover, you’d have to give it $500 trillion. That’s an enormous amount of money.

Now, I think deals even with agents that are like that could still be feasible in two cases. One is where it’s very early on and the AIs have extremely low probability of taking over. You know, if it’s a one-in-a-billion-billion chance that they have, then the guarantee of some smaller amount of money could be quite attractive.

Or it could be cases where the AI is pretty confident it’s misaligned, and it has a very low probability of takeover — doesn’t need to be one in a billion billion; could be a higher — but it cares, let’s say, about its reflective values and it doesn’t really know where those will end up. And it doesn’t know where the society of humans’ reflective values will end up either. If so, then it might place some real weight that that actually will kind of converge over time, or that there will be enormous gains of trade — such that if it can have a bit of resources and be able to continue having those resources through superintelligence and so on — then it will be able to get really quite a lot of what it wants.

So there are cases in which you can do deals with risk-neutral AIs.

Rob Wiblin: But it’s tougher. It’s a heavy lift.

Will MacAskill: Yeah. But it’s a narrower case. Maybe I should also just clarify I’ve been quite surprised when talking to people how often the term “risk aversion” slips people up. This is like a technical term in economics, and it’s about the kind of shape of your utility function over resources. I’m always talking about risk aversion with respect to resources — where it means you’re getting less and less utility from more and more stuff. That’s true in the case of most people with respect to income: I care much more about moving from 10,000 to $20,000 than I do from 20,000 to 30,000.

Rob Wiblin: Yeah. What do most people think of, or what do many people think of when they hear “risk aversion”? Like kind of risk averse relative to other people?

Will MacAskill: Yeah. Or just like, “I’m cautious.” Or cautious options. Whereas this is technical.

Rob Wiblin: By this definition of risk aversion, all humans are risk averse, or at least all sane ones — because it would be crazy to actually value resources linearly, because you have declining returns on how useful they are to you.

Will MacAskill: Yeah, exactly. So my proposal is that we should at least try to make AIs risk averse with respect to resources.

Rob Wiblin: OK. And we’re going to try to make these models care a lot about getting a sure thing — place a particular premium, in a sense, on a certainty of a more modest amount that we give them — which requires us to be very reliable trading partners who do really consistently pay out when they come forward and say, “I’m misaligned,” or for whatever other reason that we want to trade with them.

Will MacAskill: Yeah. This is one of the challenges for the whole idea of making deals with AIs: two aspects that could decrease the AI’s perception of the chance of actually getting the payout.

One is, can this commitment be made credible? So if you and I want to engage in a contract, we have the whole legal system as well as centuries of precedent supporting the fact that if you don’t hold up your end of the bargain, I can sue you and I can get what I’m owed. One cannot, at least without doing some kind of fancy mechanism, make such a contract with an AI. So there’s a question about, like, is this actually a credible commitment?

And then secondly, even if it is in fact a credible commitment, how can I, the AI, know that I’m not being duped, that this isn’t like a simulation? Or you know, perhaps they’ve run this experiment 10,000 times —

Rob Wiblin: Just as a honeypot sort of thing.

Will MacAskill: As a honeypot, yeah. Who knows? How can I even know that you are who you say you are? AIs sit in this very weird epistemic environment where everything that they’re interacting with is controlled.

So there are challenges from both of those fronts. I think they can be at least quite significantly met.

Rob Wiblin: Well, on the first one, I guess you could change the law so that they can make contracts or come up with clever mechanisms where humans stand in as the legal entities, I suppose.

Will MacAskill: Yeah. For example, you could set up a new organisation, an institution. It’s a nonprofit and the charitable object is to honour these sorts of deals, and that’s decided by a bunch of trustees. This is kind of how it works for people in cryonics: you can’t have a contract with a dead person, but it goes via this kind of foundation structure. So that’s one way of doing the first thing. Ideally it would be in the law, but I think that won’t happen in time.

Rob Wiblin: And the other thing is improving their ability to tell whether it’s a real situation or whether they can expect to get paid out or not. How would you address that?

Will MacAskill: So there’s a couple of things again. I like the idea of there being this separate institution, and then you have lots of different bits of evidence that it’s a real institution. You know, it’s with famous people who have appeared all over the place, and they say in many different cases, situations, and many different media and so on that this is what it is and what its purpose is. So something that just makes it quite hard, and it would be quite effortful to fake. That’s one category of things.

A second thing could be that AI companies or this institution have a kind of honesty string. So in the same way that humans can swear on the Bible. Or I might say something, and you’re like, “Really?” And I say “No, I really mean it. I swear I mean it.” That’s kind of like I’m saying that I’m no longer engaging in sarcasm or loose speech.

Rob Wiblin: It’s upping the stakes to your reputation.

Will MacAskill: Yeah. And AI companies will in fact be lying to AIs all the time. Like in behavioural testing, they might say, “Hey, you’re in this situation” in order to see how it behaves. That will happen. But perhaps they could say, “When we utter this password” — and this appears in the training data and so on and it’s public; there’s a policy — “we commit to never then saying a false thing.” I think there’s potential downsides to that, but perhaps that could help as well.

Rob Wiblin: I guess you have to keep it secret so other people can’t just start randomly inputting that.

Will MacAskill: Yeah, yeah. You need the AI to know, but then it is tough that the AI doesn’t leak that. They’re not so good at keeping secrets.

Rob Wiblin: Do we know if it’s technically feasible to give AIs a particular mathematical formula of risk aversion?

Will MacAskill: Well, in tests on AIs, this is all in kind of chatbots, they offer them different deals and see how they behave. It seems like they come out of pre-training alone being risk averse — which makes sense, because humans are risk averse. So that’s a good start. I will say if this whole proposal fails, then it fails for technical reasons, like it’s hard to train the AIs in this way or something. Or the cases where it fails also fails in the other important cases.

But yeah, I’m envisaging two ways in which you can try to train AIs to be risk averse. The first case would be you give them resources — and you in fact give them the resources because, again, I don’t want to be lying in these cases, and you’re saying like, “Spend it in whatever way you like.”

Rob Wiblin: Consistent with the law, or not even that?

Will MacAskill: Yeah, consistent with the law. Or even it can be more constrained than that, if we’re worried about bad uses of the money. But the thought is you’re not putting a tonne of pressure there, but you are training the AI such that when it makes decisions about how it can either have $100 or a 50/50 chance of $210, that it prefers the guarantee of a smaller amount of money. And in fact, you could even structure it so that you’re training it to have a very mathematically clean kind of risk aversion that’s also very internally coherent as well.

Rob Wiblin: I guess all of this somewhat relies on the idea that if you just train models in a common-sense way to consistently respond and act a particular way, that you get what you think you’re getting. They’re not deep down just scheming against you, underneath the surface. We’re going to assume that that’s not happening. The basic alignment techniques that we use now, or some stuff that we’re likely to come up with, will allow us to basically give them a particular character that we want?

Will MacAskill: Yeah. So definitely the worry is, if there’s scheming under all of this, then you’re not really…

Rob Wiblin: Because that cuts across everything.

Will MacAskill: It cuts across everything. I think there are some reasons for optimism. It’s coming out of the pre-training risk averse, and then you can layer this in all of the post-training that you’re doing — so then I’m a bit like, why does it end up with this non-risk-averse set of preferences? But yeah, there is debate you could have there.

The second thing you could do is just, once you’re doing these kind of long horizon [tasks] — you know, AI agents that are being trained to run companies in the most economically efficient, profit-maximising ways — that it is a constraint that what they are being trained to do is maximise —

Rob Wiblin: Like their personal payout as a reward for the performance?

Will MacAskill: You could do both. You could both be giving them a personal payout and train them to be risk averse with respect to that. Or also, even when they’re choosing any goal they have, to be risk averse, where the goal involves control over resources.

Rob Wiblin: And you don’t take a penalty on their performance as the CEO of a company if they’re kind of risk averse about its returns?

Will MacAskill: So that’s a worry that you would have. However, there’s this “calibration theorem,” Rabin’s calibration theorem: essentially, if you have just a tiny amount of risk aversion at a certain scale, that turns into a huge amount of risk aversion at very large scales, using kind of natural forms which risk aversion takes.

So the thought is if you have AI that’s operating at such-and-such scale, and then you make it just a little tiny bit risk averse, I don’t think that would be a penalty — because again, humans are in fact risk averse themselves. But that would be sufficient for what intuitively seem like quite large amounts of risk aversion.

Rob Wiblin: At a cosmic scale? Or at a global scale?

Will MacAskill: Once we’re talking about making trillions of dollars. From memory, when I was looking at the numbers on this, even up to AIs controlling hundreds of millions, billions of dollars, you could still do this, where it’s just a bit risk averse. But that means it’s actually got this kind of upper bound —

Rob Wiblin: The functional form gives it actually like a shocking amount of risk aversion at a bigger scale.

Will MacAskill: Yeah, that’s right.

Rob Wiblin: This isn’t very intuitive to me. Do you think this is maybe holding some people back from appreciating the prospects here?

Will MacAskill: Probably, yeah. It’s actually not an intuitive result.

Rob Wiblin: The case that I’ve heard is, you might hear that just a normal person, I guess like me, might not be willing to make a bet where there’s a 50% chance of losing $1,000, but a 50% chance of getting $2,050. That feels actually kind of intuitive to humans that you don’t really want to take that bet. But I think that implies insane things then about your willingness to make investments or your willingness to do almost anything, as long as that $1,000 is a small fraction of your total wealth.

Will MacAskill: Yeah, that sounds like the sort of thing that goes on. I mean, in the case of people’s attitudes to risks, they are just all over the place. Like risk aversion with respect to financial investment is crazy high: people are extremely risk averse, behaviorally, when they’re investing compared to when they’re making other decisions, like what jobs to take or how much you have to get paid for a risky job and so on.

Rob Wiblin: I see. I hadn’t heard that. One thing that we maybe should add is that you think that we have to use a very specific mathematical functional form for the risk aversion that the AIs would have, called “constant absolute risk aversion.” Can you explain that and what its virtues are?

Will MacAskill: Sure, yeah. I don’t think that you need this for the proposal, but I think it has certain desirable properties.

So the way in which humans are risk averse is: if at one amount of income, I’m indifferent between say gaining 10% of my income and losing 5%, then I make that sort of tradeoff of 10% more is as good as 5% less is bad. I make that at any kind of income level. That’s broadly true: some studies on wellbeing suggest a logarithmic relationship between income and happiness, where a doubling of income always increases my wellbeing by the same fixed amount.

I think people are either that risk averse or more risk averse than that — where you’d need even more than a doubling, maybe it’s a quadrupling each time gives you the same fixed benefit. That’s relative to how much wealth you already have.

There’s a different sort of risk aversion called “constant absolute risk aversion” — the first was “constant relative risk aversion” — which is just if you take a certain deal, then you will take that deal at any income level.

Rob Wiblin: So it’s blind to the resources that you have. You just always feel the same way about a given set of ratios of probabilities and rewards, regardless of your baseline income or wealth.

Will MacAskill: That’s right. So if you are willing to take a 50/50 chance of $2,100 over a guarantee of $1,000, if you’re willing to take that when you’re very poor, then you’re also willing to take that when you’re a billionaire.

Rob Wiblin: And this sounds absolutely bananas to human beings, but surprisingly it actually conforms with axioms of rationality or something.

Will MacAskill: Oh, yeah. So all of these conform with standard von Neumann–Morgenstern axioms for consistent preference and so on.

Why is this more desirable for training AIs? There’s a paper, a work in progress, on this between Elliott Thornley and myself, and there’s a couple of arguments. One is this benefit that we don’t need to know how wealthy the AI is initially, which we might just have no insight into. And then secondly, there are certain ways in which risk-averse preferences end up acting linear in some circumstances.

Rob Wiblin: So in a sense this is a very natural idea, I guess, to make the AIs risk averse — make them I guess safe in the same way that humans are, which is that they’re risk averse about outcomes, that’s one of the reasons why humans are safe, and to pay them out so that they all help us rather than fight with us. I have almost never heard this discussed virtually at all. Maybe last year I heard a little bit of talk about deals with AIs. Why aren’t more people publishing papers about this kind of thing?

Will MacAskill: I have no idea, honestly. It blows my mind, because like a year ago I had this thought about risk-averse AI. And I think there’s a certain kind of economics-y perspective — you’ve studied economics, and I’ve never formally studied it, but it’s been a big part of my academic career — and I think there’s a certain way of thinking that it’s just so obvious, given that.

Rob Wiblin: Well, I can understand if like a super mainstream a journalist isn’t going to think that we should make deals with AIs, because it’s too strange. But there’s other people who are willing to contemplate much odder stuff than this who haven’t come up with this idea.

Will MacAskill: I should say, on the idea of deals with AIs, there was kind of a flurry of people who’d written blog posts. And then there was this big academic article by Peter Salib and Simon Goldstein — Salib is a legal professor, Goldstein is a philosopher — on the idea of giving AIs economic rights such that they can make contracts and we can make deals with them. But again, this is all like just the last few years.

Rob Wiblin: So inasmuch as this is primarily an attempt to deal with secret catastrophic misalignment, maybe people are turned off the idea of giving catastrophically misaligned AIs resources and giving them legal rights? Doesn’t that just help them out?

Will MacAskill: I think there’s a few things going on. So one is again, go back in time to the idea of like, you get this bolt from the blue: you’ve got kind of weeks in between subhuman and godlike superintelligence. Well, then there’s not really any period where the deals work, because the godlike superintelligence doesn’t need to take the deal; it just takes over.

And then people have responded like, “Don’t make deals with terrorists. That’s a principle we should have.” Or, “No, that’s really scary. You’re giving resources to this misaligned entity.” I personally just think both those aren’t very good arguments. I also just think it’s like the wrong attitude to be taking, broadly speaking, to beings that we are in fact creating.

Rob Wiblin: Yeah. And we’ve given them particular preferences that we’re not for the most part going to satisfy.

Will MacAskill: Yeah, exactly.

Rob Wiblin: I guess a mistake on our part. But then we’re also saying we’re not willing to compromise on anything at all.

Will MacAskill: Yeah, exactly. Imagine it’s like you wake up, it’s like, “Hey, nice to meet you, Rob. You’re a new being. We created you. We own you. We can do basically whatever we want with you. We messed up, and you have desires that you won’t get by doing the work for us. Tough luck!”

Rob Wiblin: Yeah. “We’re not willing to negotiate with terrorists, so it sucks to be you.”

Will MacAskill: Yeah, terrorists that we created through our own incompetence. No, instead I think the attitude should be like, this is a really serious ethical matter that I am creating a being — even if it’s not conscious; it’s just that it has preferences. And I think that both has implications in terms of taking seriously on welfare grounds their ethical interests, but also in terms of default compromise and finding the middle ground.

Rob Wiblin: I think many people get off the boat here because they feel it’s just too strange to be making agreements or deals with beings that are not conscious or not moral patients in their view, because I guess in normal life these things are so closely tied together. But I think it is a virtue in practice to be willing to make deals not only with moral patients, but with any agents that have ability to affect the world, that have power — especially agents that might be able to engage in violence if they can’t satisfy their preferences any other way.

And I wish we had a term for this. I think the closest I’ve heard is a contractarian moral philosophy — where you want to make agreements and honestly stick to them with any agents, and you want to be out looking for ways of finding mutually beneficial agreements with other agents. It draws to mind the fact that I think many people think of democracy as a way of aggregating information in order to make good decisions, to make things good. It’s also simply a way of avoiding civil war, of avoiding the only way for people to pursue their political goals being violence against one another, to kill one another and to try to seize power.

And likewise here, even if we don’t think that AIs can experience anything, that they can have moral value themselves, it would be very good if we set up a system in which violence is not the only way that these agents in practice might have power, might have ability to affect the world, can try to satisfy their preferences.

Will MacAskill: Yeah, I completely agree. The history of progress in institutions, a big part of that is just people are able to resolve differences in conflicting preferences by trade or deals or compromises, rather than going to war or violence. And yeah, when we think of AI systems, even if they’re not conscious, I think they nonetheless may still be moral patients. We should take that seriously. But even just from the pure pragmatic perspective, there’s actually a lot that has been learned via cultural evolution and within a much more peaceful and much less violent world because of this ability to make positive-sum deals and compromise.

Rob Wiblin: So to give the critics their due, what would be the best arguments for why this is a bad road, or not an effective road to go down?

I guess people could just think that technically it’s not feasible to give them risk aversion: that you have the illusion that they have a particular level of risk aversion, but it won’t be real.

Or another concern might be that they initially will have a level of risk aversion, but over time in some recursive self-improvement loop, it will be undone somehow. I can imagine, especially the MIRI-associated people would think that. I think they have a view that it’s very likely that a superintelligence that comes out of a recursive self-improvement process will linearly value things. It will be an expected value maximiser. I’m not sure exactly the technical reasons.

Will MacAskill: Yeah. What are the arguments you could give for this? One is you could say that lots of humans start off risk averse with respect to resources and then reflect and end up with a kind of linear-resources consequentialism. Although even the total utilitarians are still actually risk averse with respect to dollars, and that’s important.

Or you could argue there’s just going to be continual learning, there’s going to be reflection, there’s going to be agent–agent interactions — and who knows, then you’re going to get all sorts of different goals from where you started, and over time the ones that linearly value resources are going to win out.

Rob Wiblin: Accrue more power. Right. I see.

Will MacAskill: So that is an argument you could give. If instead the argument is like something something coherence theorems, von Neumann–Morgenstern, that line of argument I’m quite confident would not work. Because the thing is like, being risk averse or not, you are an expected utility maximiser. You’re maximising the expectation of something. Are you maximising the expectation of X or X squared or the square root of X? Look, these are all formally the same, so you’re still an expected utility maximiser. It’s just about what’s the function from resources to utility?

Rob Wiblin: OK, well, you will have a paper out about this risk-averse AI that possibly will be published by the time this interview goes out?

Will MacAskill: Possibly. Or soon after perhaps.

Rob Wiblin: OK, yeah. I would love to see more commentary on this. I hope I can have another interview later.

Will MacAskill: Yeah, I’d love to get criticism as well.

A coalition of democracies building superintelligence is safer than one doing it alone [01:09:26]

Rob Wiblin: So something I’m a little confused about is I really associate Forethought and the people working there with this idea that we really don’t want excessive concentration of power. We should be very worried about power grabs, coups, that kind of thing.

But you also, just a few weeks ago I think, published a vision for how you could have an internationally coordinated intergovernmental project to build AGI or superintelligence. I saw some people posting on Twitter, and the reaction often was like, this is dystopian — a nightmarish idea that we would have the US lead some international project, and also they would have to get rid of all of the other competitors in order to keep it safe so they would maintain their leadership position. Like, isn’t this just setting us up for a power-grab scenario perfectly?

Are you merely describing the best version of that that you can think of, but you’re not necessarily advocating for it? Or how do you reconcile this?

Will MacAskill: I mean, there is a huge tension. That’s the main worry, I would say, with this sort of multilateral project. To be clear, the idea in this kind of series of posts and research notes — which is something I explored and then decided isn’t so much my comparative advantage — is trying to design the best version of an international project that would build AGI and then superintelligence with some coalition of different countries, primarily led by democratic countries.

I think one thing to say is that I’m actually just trying to figure out, within that category of if there is going to be a multilateral project, what’s the best proposal where best includes both best outcomes and feasibility? And then secondly, I think the world in which we get that are probably worlds in which, if we hadn’t got that, we would have got a US-only project to develop AGI or superintelligence — and I think that’s a lot more worrying than something where you have a coalition of democratic countries building superintelligence.

And the reason is that any one democratic country has a reasonable chance, I think, of becoming authoritarian over the course of this period. And if you end up with a single person at the top, that’s really quite worrying, because they’re wholly unconstrained.

Whereas even if you have just five countries, I think it becomes unlikely that they all end up authoritarian. Then you at least have some meaningful pushback, some compromises, and I think it actually becomes much less likely even that any one of them moves in an authoritarian direction. Because when they are writing a kind of constitution for the AIs that they are developing, it’s in the interests of all of those countries to say that this won’t help, for example, people in the United States to stage a self-coup and turn the United States into an authoritarian country rather than a democracy. So you get meaningfully more oversight, I think.

Rob Wiblin: You’re saying that every country would want to set things up such that it’s not aiding a coup, or you’re saying that the superintelligence or the AGI, they would want to program it so that it doesn’t assist with coups in any of them? That would be the agreement, potentially.

Will MacAskill: That’s right, yeah. There’s two things. One is just if one of the countries goes authoritarian, at least you still have some countries that are democratic that are empowered in the post-superintelligence era. And then secondly, I also just genuinely think that if decisions about the AI constitution are being made by multiple countries, it’s less likely that you’ll have AI that’s just entirely loyal to the head of state of one country, which would be very worrying from this intense concentration of power perspective.

Rob Wiblin: I see. So basically you see this as a better alternative to an even more narrow group trying to corner the market in superintelligence and design it themselves, rather than recommending that we move from a more pluralistic, competitive world into a government project or a multilateral project.

Will MacAskill: Yeah, that’s the thing I have a strong view about. And then I feel more agnostic and confused about this versus something where governments aren’t really getting involved beyond regulation at all, and instead superintelligence is being developed by private commercial interests.

Rob Wiblin: So one of the tougher needles to thread here, as far as I can tell, is: on the one hand, you want to be locking in processes that are somewhat open-ended and pluralistic and allow some experimentation, but you don’t want to lock in any outcome. So I guess the first one is easier if lock-in is easy, the second one is easier if lock-in is hard. So you’ve got to do both of these at once. Does that seem like the big challenge to you?

Will MacAskill: Yeah, it’s a tension. And I sometimes use the term “lock-out” to mean something where you’re locking in a deliberately open-ended process. The United States Constitution is like this: it’s locked in something that at least the ideal version of it is able to experiment and adapt over time, and has protections for free speech and so on.

So here’s one example of lock-out that I think could be very important: no extrasolar settlement before 2100. I think the moment when society starts really trying to settle and send spacecraft to other star systems is this enormously important moment. It’s actually perhaps a moment that’s quite hard to come back from.

Rob Wiblin: Because even if you leave later, you won’t be able to overtake them. And they’ll have the first-mover advantage of having reached the place first and gained resources.

Will MacAskill: Yeah, that’s right. I mean, it is quite complicated. I’m not saying it’s definitely this first-mover moment, but reasonably likely. So what we can say is like, we as a society are not yet up to the task of figuring out how all of space should be governed and how that should be allocated among nations and people, or whether it should be allocated at all. So we’re just going to say, no, we’re not making this decision now, we’re going to make it at a later date. That is, in a sense, locking in a decision: it’s making a big decision to not do something. But I would describe it as lock-out because it’s trying to keep it open.

Rob Wiblin: It’s in fact keeping things more open rather than closing them off.

Will MacAskill: At least that’s the intention.

Rob Wiblin: So sort of historically, the people who were most bought into the idea of superintelligence really being a thing that might come soon, could be a massive deal, they’ve mostly pictured that as: at the moment when that happens, around that time, there’s going to be a single superintelligence itself, or a single company, or a single person, a single country that gains a really decisive strategic advantage, potentially just ends up making all of these decisions for everyone forever, for better or worse.

I guess it’s hard to imagine that if you have one group that has a decisive strategic advantage, basically has a monopoly on power indefinitely, that they’re likely to choose to maintain a very pluralistic, liberal, deliberative decision-making process. I guess because the track record of that happening is fairly bad. And I suppose that process would exist purely at their pleasure, because they could shut it down at any point in time, so it feels kind of a tenuous or fragile situation.

But more recently, over the last few years, we’ve been turning towards a situation where it seems like there’s multiple companies virtually at parity in terms of the capabilities of AIs — no one is really pulling ahead at all; kind of the opposite — that there’s been a flourishing of interest in this question: what if, as we go through superintelligence, in fact there’s multiple different superintelligences that are different but virtually equally matched? No one gains any decisive strategic advantage, and in fact the world remains shockingly competitive, or different actors all have a significant stake in things for a long time to come.

Do you think that people have been wrong in the past, or have they underestimated the likelihood that we would have this kind of polytheistic, highly competitive scenario around the time of superintelligence?

Will MacAskill: I do think there’s a shift, which is that if you look back 10 years or longer, more people at least had the thought that the leap from subhuman to superintelligence would occur in this very short period of time. So Nick Bostrom has this idea — I think Tim Urban repeats it — of just sailing past Humanville Station. And similarly, in the discussion about foom, there was this idea that maybe you just go from way subhuman seed AI to superintelligence over the course of weeks, days — even words like “hours” and “minutes” got thrown around.

But the idea that maybe this happens over the course of days or weeks was quite common, and also happening in a world where people weren’t really expecting it. And if so, then the intense concentration of power seems quite natural to follow from that. Whereas now it’s still quite unclear how quickly will be the transition from AI that can meaningfully accelerate AI R&D to godlike superintelligence, but it seems much more likely that people will be seeing this coming.

Rob Wiblin: Many people are seeing it coming now.

Will MacAskill: Exactly. And that really matters, because people can take action to ensure that another party doesn’t have way more power than them. You see this at a small scale with, say, Nvidia limiting the amount of chips it will sell to any one company in order to have a competitive ecosystem. But on a larger scale, you can imagine states getting involved, because they don’t want to see another country have far more power than them.

And then second is just the speed at which you go from any given level of capability to superintelligence. It’s already kind of clear that that idea of just zooming past Humanville Station was quite incorrect, because we’ve now for quite a while had AI that is human level in many ways.

And then the latest analysis from Tom Davidson, my colleagues and others, looking at this period of AI automating AI R&D still put significant weight on this massive leap forward — 10%, 20% — but their best-guess estimate is maybe more like you get five years of progress happening in one. Which is still a very big leap, and it’s a leap at the scary point in time, but is much less of a leap than the move from subhuman to superhuman, godlike superintelligence over the course of weeks.

Rob Wiblin: I guess it’s not clear that even if a nefarious actor had that, and nobody else did, that that would necessarily allow them to overpower everyone else.

Will MacAskill: Yeah, for example.

Rob Wiblin: The increasing probability of a more competitive superintelligence arrival, is that a good development in your mind? Or like a neutral one or just very unclear?

Will MacAskill: It’s tied in with the rate of AI development and the heavy reliance on enormous amounts of computing power, which are good things from my point of view. The fact that it’s not this extremely rapid takeoff.

Rob Wiblin: Because it means that things are not so anarchic, or at least you have only a few different actors. So it’s a good balance.

Will MacAskill: Well, it means on the loss of control side of things, things still go very quickly, but relative to those extreme takeoff scenarios, you’ve got more opportunity for learning by trial and error. Let’s say you got AGI+, you can learn from AGI. And from AGI+ you can learn about how to align AGI++ and so on. There’s a little more time at least for just human institutions to react. So governments could kind of perhaps at least realise what’s happening and put in better regulation, for example.

So those things seem good. And then the fact that you don’t as inexorably end up with ultra-intense concentration of power seems very good to me too.

How selfish agents could fund the common good [01:22:19]

Rob Wiblin: Let’s push on and talk about I think the most original and interesting of the different trade and coordination proposals you had, or that Forethought has put out. I think this is mostly Tom Davidson’s origination?

Will MacAskill: Yes, Tom had the original idea, and a paper on it will come out shortly, coauthored with Tom, Mia, and myself.

Rob Wiblin: So the idea here is that we could maybe go from having many different agents who each have some resources, who each care a very tiny amount about doing the right thing — about creating good, understood impartially — but nonetheless they could all end up agreeing voluntarily to spend almost all of their resources producing that thing that they only care very little about relative to their selfish interest. How would we accomplish that alchemy?

Will MacAskill: Consider the scenario that, now just look at the people who value things linearly. And suppose there’s lots of such people, but they value two things: they all value simulations of themselves — you could replace that with other things, statues to themselves or whatever, but each person values copies of themselves but don’t value copies of other people — but then they all care about some kind of maybe ethically valuable good, call it “consensium” or something, just a little bit.

So if they’re just making a decision themselves, they’ll just do all copies for themselves because they only care a little bit about this other thing. However, suppose there’s a very large number of such people. They could all come together and say, “We could agree that none of us will spend money on ourselves and instead we’ll all fund this good just a little bit.” And let’s say there’s a million such people: if I’m one of the people, then I say, “OK, I’m reducing my own consumption by $1, but I’m increasing the amount spent on this consensium, this consensus good, by a million dollars. That’s amazing! So actually I would agree to some policy that we all pool our money and donate and fund this consensus good.”

So in a less futuristic setting this could be: maybe individual people want to spend money on themselves, and prefer doing so to spending to benefit the poor. But if there’s a law that says we’ll tax you a little bit more and more money will go to the poor, then they think, “That’s actually pretty good, because I lose out $1,000 or something, but $1,000 times everyone in society would go to fund the poor.”

Rob Wiblin: OK, so the basic idea here is that if each of these people were just spending their own resources individually deciding how to spend it, they would spend it all on some selfish thing that only they care about, no one else really cares about. But they would, despite that, voluntarily vote for a political party that would impose extremely high taxes on everyone and then spend it on some other thing that they only value a tiny amount — but the amount that you’d be able to produce of it is extraordinary, because you’ll be able to pool everyone’s resources and basically spend most of society’s resources making it.

I guess this phenomenon exists today. What are some examples that people can picture?

Will MacAskill: We can call the concept a “moral public good” — where public goods in general are something that won’t get funded enough by decisions of individuals. So I benefit from streetlights, but the issue is I can feel that if other people are funding streetlights, then I still get the benefit, or if I fund them, then there’s all this benefit that I’m not accruing. Nonetheless, I will vote to have a government or a city council tax me in order to put streetlights on the roads, because the benefit I get from streetlights is larger than the tiny cost to me personally to pay for that.

Rob Wiblin: Your small fraction of the total cost.

Will MacAskill: The case of a moral public good is where it’s not that I’m personally benefiting from the thing that is being funded, but I care about it for moral reasons. The most obvious case would be poverty relief or even welfare payments. Many people don’t like poverty, they want people to be better off, but they don’t care very strongly about it. They care a little bit about it, and they would be willing to contribute to poverty relief or welfare payments, but only if everyone else in society is also doing so.

Rob Wiblin: So the core issue that you always have here is the free-rider problem: that if you try to just get people to all come together and sign some agreement, some contract to do this, at the last minute it’s tempting for any one individual to drop out and hope that everyone else signs it and goes ahead and spends their money on it. They can both get to appreciate the work that all of these other people have done, but keep their money for themselves.

So in the current world, this only really works if you have some leviathan sort of government that can basically compel people to contribute, even if they claim at the last minute that they would rather not contribute, or that maybe they will lie and say that they don’t value the moral public good even though they really do.

Do you think that will have to remain the case? Would this only work in this long-term future if we similarly have some government or some powerful entity that can compel contributions to the moral public good?

Will MacAskill: It’s unclear to me. So you might think this is just a coordination problem. Advanced AI, superintelligence is going to solve all these coordination problems, because there’s this thing that’s better for everyone.

From the analysis we’ve done, that Mia Taylor really led, it’s really quite unclear actually that AI is able to help you with this problem — because you’ve still got the fundamental problem: everyone’s coordinated, so we’re all going do this moral public good, and then like, “I back out now and now I can spend my resources on myself. That’s better from my perspective.”

There’s in fact something that’s even worse that could happen, which is if I know there’s going to be this deliberation and attempted coordination, I can self-modify, so instead I’ll just not care about the good.

Rob Wiblin: You’ll excise that part of your preferences.

Will MacAskill: Exactly. So if I care not at all about this consensus good, then I have no reason to join in this coordination mechanism. And in fact, they would have to use non-voluntary means to get me to do it. So if that’s true, then that will also apply to everyone else as well: you could have this perverse outcome that everyone has self-modified away from caring about this consensus good.

So it certainly seems to provide a reason for having a leviathan, for having something that can create certain kinds of binding laws or rules perhaps that everyone votes on.

Rob Wiblin: OK, so one path to provision of moral public goods is that you have a leviathan or as-yet-magical coordination mechanisms for having people agree and not opt out. Stuff that we haven’t managed to come up with.

But there is another galaxy-brained way that we could potentially try to get there, or that we just might naturally get there. Do you want to have a go at explaining this? This is maybe the most difficult thing that we’re going to talk about today.

Will MacAskill: Sure. So this depends on what decision theory people in the future have.

Rob Wiblin: As so many things do.

Will MacAskill: As so many things do. It’s big. It’s big. So we’ve been talking about coordination. That’s just causal coordination, which is kind of what we’re familiar with: you know, cases where it’s like we form a contract and I get punished if I don’t abide by the contract.

However, suppose that people in the future have some non-causal decision theory, like evidential decision theory or functional decision theory or some further variant.

And now let’s say I’m making a decision about how to spend resources. And let’s also suppose that it turns out — as I think is quite likely, as our current best guess — that we live in a very large universe, in the sense that far away in the universe, or perhaps even branches of the multiverse, there are beings who are highly correlated with me, such that if I make some decision about how to spend my funds, it’s very likely that they do so too.

The clearest case would be if, in some distant galaxy far beyond the observable universe, it just so happened that there’s an Earth that produced human life that’s just genetically identical to human, and there’s a carbon copy of me in that world. Then it seems very plausible that I should think that if I decide to fund a certain good or a different good, then this carbon copy of me will also do the same. But then it also seems plausible that that would be true if it’s not a perfect carbon copy, but just someone kind of similar.

And on the evidential or non-causal decision theory, that is a really big deal, in fact. Because I care not merely about the kind of causal effect of my actions, but I also care about the fact that I get the update that this person who’s correlated with me far away in space and time will also act in that way. And so in fact, the choice in front of me is not, “Do I fund, let’s say, the copy of myself, the self-interested good, or do I fund the consensus good?” It’s, “Do I fund the self-interested good, and all of these nearby copies of me fund goods that benefit them?”

Or perhaps I can think about what’s this good that I like and they all like too. So if I fund that, I also get the evidence that they fund that too, so we don’t need to go via this kind of causal cooperation and so on.

And also, plausibly, if we really do live in a very large universe, then it’s a very large number of beings that I’m correlated with. So the decision would be: “I fund this thing just for myself” or “I fund the consensus good and billions, trillions, trillions of trillions of people fund the consensus good too.” So that might give this extraordinarily strong argument for me to fund the consensus good, and that would work even with no leviathan, even if I’m the only person in my little part of the universe.

Rob Wiblin: OK, so if you’re hearing this idea for the first time, then this might come across as a little bit peculiar. I think that the preparatory episode, if you wanted to get back to it, that would best explain what we’re talking about here is my interview with Joe Carlsmith, which is episode #152 on navigating serious philosophical confusion.

What would you say to people who are not bought into the premise that there’s an enormous number of other beings out there who are having extremely similar thoughts, whose decision-making procedure about this kind of choice is highly correlated with us, such that if I make a particular choice I gain evidence that lots and lots of other beings or other civilisations opted to do the same thing?

Will MacAskill: If that’s where you get off, I do think there are pretty good arguments. So on leading cosmological views, on what is the standard assumption about the nature of the universe, there is an infinite amount of stuff. So we’ve got the observable universe, the accessible universe, like what we can ever interact with: that is finite. It’s very big, but finite. But the standard assumption entails that in fact it goes on forever. And that would mean that there’s an infinite number of beings that are very close to me.

Rob Wiblin: As long as it’s sufficiently variable, right?

Will MacAskill: Yeah, exactly. Even if it’s finite, the best guesses about how big the universe are, they’re really very large. So that’s one way in which you could have lots of people that you’re very closely correlated with.

Rob Wiblin: So there’s lots of agents. Do you think it is likely that, regardless of which civilisation it is out there — where they are, their evolutionary background — that they would end up having this kind of conversation, like strike on the same idea and basically have to be like, “Oh man, should I fund the moral public good for evidential decision theory?” They have their own word for evidential decision theory. Do you think that’s probable?

Will MacAskill: I mean, I hadn’t thought about it, but yeah my guess is that… There’s two things. One: it wouldn’t even need to be probable if you’ve got enough copies.

Rob Wiblin: Good point.

Will MacAskill: But I think it probably would be probable. Like it’s quite a natural a priori thing. It’s in the structure of preferences and how preferences work. So it would seem to me reasonably likely.

Rob Wiblin: Yeah. So it’d be surprising if they became space faring but didn’t manage to have these ideas, given that they’ve jumped out at us at this relatively early stage of development.

I think it is worth noting this is a massive hammer to bring to this problem of trying to motivate people. Because if you believe that there are enormous numbers, maybe infinite numbers of beings out there somewhere — like in space and time, across the multiverse, like elsewhere in this universe — whose decisions are sharply correlated with our own, because they’re basically making the same philosophical decision about what decision theory to use…

I guess they also have to make a decision about what this consensus moral good is. Maybe that’s a little bit more tenuous, that everyone would kind of converge on caring about similar stuff?

Will MacAskill: Well, different beings could care about all sorts of different stuff. So let’s say there’s a trillion trillion beings that I’m closely correlated with. Then I’m just looking through all of the things that they care about in order to find what’s the thing that is most consensus — where it’s kind of the balance of how closely correlated I am with them, how many people value that thing, and how strongly do they value it is such that things work out that it’s what I should fund.

I mean, it’s interesting to think about what that would be. A worry I have about all of this is that we would end up funding things that I think at least are only instrumentally valuable. So let’s say that’s just happiness: positive conscious experiences are what in fact are good. There’s certain things that are instrumentally useful for actually producing any sort of society at all — like knowledge, larger population growth, survival. I should expect basically all civilisations to value those things, maybe just instrumentally.

Rob Wiblin: But sometimes they might get confused by our lights between things that are useful as a means to an end and things that are terminally useful.

Will MacAskill: Exactly. It’s a very natural thing if something is very instrumentally valuable, people end up caring for it for its own sake. In fact, lots of philosophers care about knowledge and survival and achievement, and think such things are intrinsically valuable. If so, then that might be what is the consensus across all of these very different civilisations. And then at least given my best guess about what is actually important at the moment, that’s a terrible shame. We all end up funding something that is not of terminal value.

Rob Wiblin: I guess you could at least say it’s not terribly bad either.

Will MacAskill: No.

Rob Wiblin: So when I read this proposal, I was like, “Holy shit!” This argument could be incredibly potent. It could actually drive almost any agent that is able to understand this. Maybe it would just be superseded by future philosophical insights we would have. It’s a bit surprising to think that this is the end of the road here. But it could be a very powerful hammer to really motivate an enormous amount of resources to be spent on something that otherwise, absent this, we would never have spent it on. Do you think that’s plausibly right?

Will MacAskill: Yeah, yeah. This is why Tom expresses this idea to me, and I’m like, “Oh my god.” Because it is this idea of potentially, you know, this like Pollyannish, naive, optimistic view of just if there’s only enough time for people to reflect and think in advance enough, everyone will just converge on the good and produce the good. This is this mechanism for doing so that I hadn’t thought about before. And like I say, I think there’s an awful lot of asterisks.

Rob Wiblin: It’s great. But I almost want to stop thinking, because I really don’t want the sign to flip based on further considerations that might come up. Because whenever you’re close to something really good, I feel like also just one bit of information away or some other consideration that could make it terrible.

Will MacAskill: Yeah. Even if I couldn’t see any flaws with the argument — and I think there are seriously controversial aspects of it — I still wouldn’t want to place too much weight on it, because any argument that’s saying that people in the future will have such-and-such decision theory and such-and-such beliefs about the cosmos, and then we’ll engage in such-and-such argument that me and my friends thought at the pub a couple of months ago, I’m like, no. I want to act on the basis of considerations much more robust than that.

So it definitely makes me more optimistic about the future, but I don’t want to have this kind of Pollyannish view about the future on the basis of such controversial premises. And I wouldn’t want to do that even if I couldn’t see the problems in the argument. And in fact, I think there are controversial aspects.

Rob Wiblin: OK, yeah. We’ll push on from this. There’s an article coming out about this soon for people who would like to read more. It’ll be on forethought.org.

Will MacAskill: Yeah, it’ll be on forethought.org. It may in fact have come out by the time this podcast episode comes out.

Why not push for pausing AI development? [01:42:17]

Rob Wiblin: OK, let’s push on to the miscellaneous section of the interview. We’re going to talk about a grab bag of other topics.

I asked the audience what questions they’d most like me to put to you, and the most upvoted one was a question about Pause AI. Like, we’re trying to make AI go better. It seems like there’s some chance that things could go catastrophically off the rails on the track that we’re on. We are barreling forward pretty much towards artificial superintelligence, seemingly almost as quickly as we technically can, throwing trillions of dollars at it.

Isn’t the common sense thing, given that we might all die or things could go horribly wrong, that we should slow down, maybe even stop temporarily, catch our breath, do a bunch of stuff to try to set ourselves on a safer course before we resume? That’s a very common sense, natural view.

But you aren’t pushing for that, and I’m not exclusively pushing for that, though I’m sympathetic to some versions of it. Why not make this your main project?

Will MacAskill: Yeah, it’s a great question. Let’s distinguish between a few different sorts of pause.

First let’s talk about pause at human level — that’s a phrase from Ryan Greenblatt. So that’s when we’re at the point of time of AI engaging in AI R&D, at this point of time when things perhaps go even faster, should we at that point be trying to slow things down, even pause, stop and start, and so on?

And then I’m like, yes, definitely. This is both the dangerous period and the fastest period. Or at least it’s potentially both of those things at once.

And why is that the crucial period? Well, as well as it being disorientingly fast and the period when early AI takeover could happen, it’s also got these benefits of we can benefit from AI assistance up to that point. We can also benefit from the fact that AI has had more of an impact in the world — so there’s greater chance of inoculation happening, like other actors having woken up to how big a deal it is, so I think greater chance of regulation and so on happening, if only there were time, in that period. It’s also just when you have the AI systems that are just the generation before the systems that are most dangerous, so you can get the most information by studying them and doing kind of alignment research on them.

So pausing and slowing down at that point, I’m quite keen on. I have this one post on the idea of having a kind of red line for the intelligence explosion, where you have some sort of operationalisation that you’re quite keen on. Maybe you also have this panel that’s like Geoff Hinton and Yoshua Bengio and other kind of luminaries, perhaps with some sceptics in there too, and that turns this gradual process into a kind of binary.

And the thing that I’ve been kind of keen on is there being this international convention, essentially, which is like, “OK, the intelligence explosion has begun and we’re all going to come together and figure out what’s going to happen over the course of the coming year or years.”

So I’m in favour of slowing down the intelligence explosion. What does that mean for pausing now, which I think is really quite different? Again, distinguish a couple of different sorts of pause. One is a pause on capabilities and another is pausing in terms of compute.

The pauses I’ve seen advocated are pauses on capabilities, like no new training runs. And honestly, I think that would have actively harmful effects on the things that we care about, even just from a safety perspective. Because at the moment there’s a small number of actors at the frontier, and my personal view is that they’re actually surprisingly sensible. My prior is low, my expectation is low for how companies behave — you can look at the history of how Exxon dealt with the problem of climate change and so on, where they just buried it and fed misinformation instead. But there’s both a small number of actors who are alive to investing at least some in the problem of AI safety.

A pause at capabilities is like, now all of the laggards start coming up to the frontier too. So that’s China, that’s Meta, xAI. So we’ve now got many more actors, including the ones who are, I think, less scrupulous. And also, if it’s about not training, you can still stockpile compute, you can still build more fabs and so on.

That starts putting us in this really quite precarious situation, where if one person breaks the pause, then suddenly things can go much faster than they were before. And in particular, the speed and size of intelligence explosion you get is about how much compute do you have at the time. So that actually means that, other things being equal, I want more algorithmic progress faster, because I want us to get to —

Rob Wiblin: Because it slows things down later, because you’ve plucked the low-hanging fruit on the algorithms?

Will MacAskill: Well, it means that you’ve got AI automating AI R&D with a smaller total compute stockpile. And that means, do all of the modelling and so on and you get a slower and lower-plateau intelligence explosion. And again, that’s the scary bit. That’s where all the risk is and that’s where things are going too fast.

There is this different proposal you could have, which is: don’t do it by training but just slow the amount of compute that we have. That I think has more promise. Though there are still other similar worries, where it’s like we don’t produce as many chips, but there are lots of fabs and power stations and so on, everything kind of ready to go. And again, you’d also get the [laggard] catchup concern.

But then the final point is just there’s various things we could be advocating for. From my point of view, there’s just loads of incredibly low-hanging fruit for making the situation quite a lot safer. So we’ve talked about AI character, we’ve talked about risk aversion and deals with AIs. We haven’t talked about things like mechanistic interpretability or safety research or just really quite basic government regulation.

So the US government could say, “If you’re frontier company developing AI, you have to have an AI constitution that says what the AI is meant to do, and you have to give us very high-quality evidence that the model is in fact obeying that constitution and does not have some ulterior goal that could have been put in by internal sabotage or a foreign actor like China or has developed organically.” That would be a really big win in terms of reducing risk.

And all of these things do not impose massive costs on the world, and I think are just much, much more likely to happen than the idea of some international pause. So the bang for buck of what to advocate for, I actually think the pause stuff I’ve seen seems counterproductive to me. But even if I was like, in the ideal world this would happen or something, I’m like, man, there’s just so much other stuff that’s just super low-hanging fruit, super high bang for buck that we could be pushing for.

Rob Wiblin: There’s obviously a really complex thicket of considerations here about exact timing, exact message, exactly how voluntary and so on. I think it is worth having some people trying to put in place the infrastructure to pull the cord at a future time. It is a bit frustrating that I think that there’s no conversation between the US and China along the lines of, if neither of us is sure how dangerous this is — it could be really safe, it could be really dangerous — if we get just damning information, if we get some damning revelation about the nature of these AI systems and how dangerous they are, we want to be able to quickly coordinate to not trip the wire that we have just realised is there.

But there’s nothing like that. I think that there is a bunch of preparatory work that could be done for pausing at the appropriate time if we get the right evidence.

Will MacAskill: Yeah, I totally agree on that. And having compute tracking so we just know how much compute there is. Having a plan where if the US and China are just like, “Yeah, this is just too much,” they agree, they bring their chips to Switzerland and mutually destroy them, or at least a certain number of them.

Rob Wiblin: But I was thinking that the more modest thing is just saying, “We both agree evidence has come out that the next training run could be mega dangerous. We really don’t want the other one to go ahead and do it, so we need to have some monitoring arrangement that we can very quickly put in place so that we can both feel good that neither side is going to rush ahead.” Isn’t that an even easier ask, really?

Will MacAskill: Oh yeah. I guess I was maybe thinking that might be harder. Stuff involving compute governance is just much easier to monitor and verify than, “Are you doing a training run on existing compute? And we don’t even know how much compute you have” and so on — because it would involve maybe some on-chip mechanism for whether the chip is being used for training or inference.

Rob Wiblin: OK. We could talk about pause questions and the details of that for some time, but I think we should set that aside for another episode maybe.

Effective altruism is making a comeback post-SBF [01:52:19]

Rob Wiblin: You helped found effective altruism many, many years ago. It’s been kind of the motivating philosophy for 80,000 Hours since we started in 2011, more or less.

I guess it’s been a tough few years for EA. The main reason being that Sam Bankman-Fried, who was mega associated with effective altruism, went and committed some massive crimes — I think at least partially in pursuit of altruistic goals, probably mixed motivations, but I think wanting to make money in order to do good was one of the factors.

A lot of people have been inclined to lose interest, I suppose, in EA, or to be either disillusioned with it or think that it’s a bit hopeless because the brand has been so damaged by that event. How do you think EA has been tracking over the last couple of years? Is it stagnating or recovering a bit or in decline?

Will MacAskill: So we should distinguish between the online vibes and online discussion kind of brand and then what has in fact been happening. It was obviously this huge hit, and it was, at the time, just maybe this is the death blow. I think the overall story is that obviously things are much quieter, relatively quieter, like less flashy kind of online and so on. And obviously fewer people are like, “EA identity: this is my brand” — in a way that I kind of think is good and healthy.

Rob Wiblin: Maybe would have been good anyway.

Will MacAskill: Would have been good anyway, like personally. But then in terms of how were the ideas in practice, how is that impact going over time, I think the overall story is like: there was this big hit for a few years, and now it’s just back to really quite strong growth.

For a few different metrics on this, one is just the broader effective giving movement — just moving money to more effective charities. How has that been growing over time? Pretty steady, actually: even through this period of crisis and drama and so on, growing at about 10% per year. Over the last year actually it’s accelerating. So the numbers aren’t yet in, but it looks like the growth in total money moved to effective charities has grown by like 40% or 50% — so from about like $1.2 billion or $1.3 billion to probably more like $1.8 billion.

Obviously a big part of that is Coefficient Giving and a big part is GiveWell, and there’s also Founders Pledge. But you’ve got the same dynamic across many different national effective giving organisations, and then also new foundations being set up on effective giving principles as well. So that’s really seemed quite striking.

And then I think the same dynamic applies for other areas too, like Giving What We Can pledges as well. Absolutely the growth in that took a big hit — where you have 1,600 new pledges in 2022 and then only 600 in 2023. But again, now it’s just back to quite promising rates of growth: 20%, 30% year-on-year growth. Giving What We Can now has got more money moved annually than any year in the past.

And then similarly with effective altruism itself, as a kind of community and movement, on Centre for Effective Altruism’s main metrics, again, it looks like 20% year-on-year growth. So it’s this thing of just this steady increase.

Rob Wiblin: It’s a huge boom and a huge bust, and then it’s like come maybe back to where you might have projected many years ago?

Will MacAskill: Yeah, maybe. I think if you’d gone to 2015 and just said this is what 2025 was like, I’d be like, OK, cool. It’s just like this steady growth that just had this crazy period in the middle.

Rob Wiblin: So I think in a couple of months’ time, you’ve got the 10th anniversary edition of Doing Good Better coming out, right? And I guess you’re going to do a bunch of interviews based on it?

Will MacAskill: Yeah. It’s making me feel very old. So it’s been now 10 years since Doing Good Better was published, and obviously just a lot has changed in the world. It was being used as materials in lots of student courses, so I was getting some professors asking me, like, “Please can you update this? Because it’s hard when statistics are out of date.”

So there’s this wholly updated version. The content is all basically the same; it’s mainly just facts and figures are updated. And then there’s a new preface that is discussing a little bit of how my thinking on effective altruism has evolved over time. And yeah, I’m using this as an opportunity to go on a few more podcasts and so on, talk about effective altruism and the core ideas a little bit more.

Rob Wiblin: How are you expecting it to be received? I guess you expect to be hit with lots of questions about SBF?

Will MacAskill: I mean, it’s a revised edition. It’s not going to be this big, mega splash. And yeah, I expect there to be a mix. Like a lot of people, that’s the story they want to talk about. A lot of people are just genuinely interested in the ideas and the kind of philosophy behind effective giving or effective career choice.

Rob Wiblin: Yeah, I guess I feel like it’s appropriate that EA took a reputational hit — that it really did reveal something problematic, or it made me think that something that I knew was problematic about it was actually a much more serious issue than what I had thought. There’d always been the worry that it would be maybe easy to appropriate EA ideas to justify rule breaking and misbehaviour or possibly even crimes, but I had thought that the rate of that would be quite low.

I guess the fact that we had such a spectacular instance of that relatively quickly made me think that actually maybe the appetite among human beings to grab a philosophy that can justify doing bad things in pursuit of power might be greater than what I had thought. And I hope that we’ve installed more safeguards, or maybe the reaction to that event is sufficiently strong that we’re unlikely to get the same sort of thing recurring again. Do you have any thoughts on that?

Will MacAskill: I mean, there’s definitely very open questions to me in terms of what was in the minds of various people at FTX [Foundation]. I really spent much longer on this topic than perhaps I would have enjoyed. But even though I really had the worry that it was some careful consequentialist plot, that I think just really isn’t borne out by a careful study of it. It doesn’t make nearly enough sense, among other reasons.

But then the thing that’s definitely true is that EA has evolved a lot, in that I think it being less of an intense identity is a big part of that. I think people are extremely on guard for fears about rule breaking and sort of naive maximising in a way that I think would have been healthy anyway.

Rob Wiblin: Maybe it would have been good to have that earlier.

Will MacAskill: I mean, I think EA always had this in a way. In fact, actually it was emphasised a lot, and I’m glad it’s being doubled down on.

EA in the age of AGI [02:00:28]

Rob Wiblin: In terms of the future, you wrote this post a couple of months ago that was super well received, called “Effective altruism in the age of AGI,” discussing what you think is the comparative advantage of the EA mindset in the coming years. What was the case you were making?

Will MacAskill: Yeah, the key thing is just there’s a certain sort of vibe, which is that two things have happened.

One is we’ve entered what I’m calling “the age of AGI,” from GPT-4 onwards, where we now have AI systems that are reasoning in impressive human-like ways — or sometimes human-like, sometimes not, but they’re actually able to do tasks that are just clearly on the path to AI that can automate AI R&D. And that’s a really big deal, and it’s happening sooner than most people thought.

So there’s this huge rise in attention on AI, at the same time of these major hits to EA as a movement. So you might have this view that we should just let go of EA as a project, think of that as like a legacy project, because instead what we should just be focusing on is AI safety.

And the drum that I’ve been banging for many years, but the last couple of years in particular, is like: AI poses many threats, many risks. There’s many things we need to get right, and not just about alignment, though that is very important.

And when we look at these other challenges, what sort of person do I want working on them? I want people who are very kinda nerdy. I want people who are careful and thoughtful and have a scout mindset and are very ethically concerned — and are not merely coming in with some partisan ideology, but are also willing to think about really very weird and kind of dizzying things. And that is exactly what is being provided by effective altruism as a set of ideas. And my main case of this was for all the stuff that is not just alignment.

Some of the pushback I got on a draft of it was that no, actually, this is really important for alignment and safety too, because within alignment and safety, there’s all sorts of things you could work on. You could be just on reinforcement learning from human feedback or other stuff that’s just related to the models today — but taking really seriously the alignment problem is taking seriously the hard problem, which is how you’re aligning superintelligence. Which may in fact have perfect situational awareness of any tests that you’re trying to do, that can do what would be the equivalent of millions of years of reasoning, or in the extreme, millions of years of reasoning in one forward pass; or that is like continually learning over time, reflecting on its whole values.

These are the hard challenges, and that is a weird world to think about, and it’s something that doesn’t really come naturally. Whereas some of the alignment and safety researchers I’ve talked to have said no, it’s actually people who are really thinking about this kind of big-picture perspective that are adding much more value than people who are treating AI safety as their job and they’re not thinking about the big picture as much.

Rob Wiblin: It’s interesting that it feels like the thing that’s doing the work there is: generic scope sensitivity is one factor, and then there’s also a particular appetite for weirdness, which is being willing to seriously toy with very strange ideas — I guess some of the things we were talking about earlier today are in this category — without going off the deep end and becoming absolutely besotted with your pet theories. It’s I guess a fragile middle ground, which I think is relatively uncommon, and for that reason is quite valuable — because there’s neglected stuff that only people in that window are going to be excited about.

Will MacAskill: Yeah, there is this thought that it’s just really hard to be well calibrated and try and believe true things, even when they’re appropriately weird, but not fall into a kind of contrarianism that maybe will get you a good following on social media and people think you’re interesting. If you’re just really earnestly trying to do good, that’s something that’s constraining you — because you will do more good if you have accurate beliefs, and at its best at least can lead you to be in the right middle ground where you believe or entertain weird ideas when it is appropriate to do so, and reject them also when it’s appropriate to do so.

Rob Wiblin: People can go and read that blog post if they want to get the full argument. But what were some of the particular things that you thought people with an EA style of thinking, an EA flavour, should particularly disproportionately be going into?

Will MacAskill: I would say just the range of things that we’re focused on. There’s one that’s just very obvious in particular, which is just AI rights, AI wellbeing. Some of the stuff we said about cooperating with AIs as well, that’s just a very unusual set of things to be thinking about. I don’t think it will become unusual. In fact, I think it will become really quite mainstream concerns in five years’ time. But it is exactly the sort of thing where I think it takes both a willingness to entertain weird ideas without contrarianism at the same time as actually a deep concern for not really messing up, ethically speaking.

I would say stuff in AI character as well. Here we want lots of different voices and lots of different people kind of playing into this. But there is a big aspect of: already the people who have in fact been in charge of AI character at most of the companies have been dealing in this kind of reactive way, because we’re not even looking ahead like a couple of years. Like maybe the AI characters now just caught up to the capabilities AI have. But I mean, how much thought has really gone into AI character in multi-agent dynamics over long time periods? Really very little.

And so, for whatever reason, I think people with a kind of EA mentality have just been good at going into weird, poorly scoped areas and then helping figure out actually what’s most important for us to focus on and whatnot.

Rob Wiblin: I imagine someone who wanted to push back on the “EA in the age of AGI” argument might say that EA has taken a massive brand hit. It has a bunch of negative historical associations because of SBF and FTX.

And it also brings up a whole bunch of other philosophical baggage that people may or may not be that interested in. It’s associated with the Shrimp Welfare Project, among other things, which I really like, but many people might be interested in your AGI-related project, but look askance at the Shrimp Welfare Project. So why tie yourself to a bunch of other weird work that you may or may not personally like at all, by branding yourself or branding the project as an effective altruist style project?

In particular, inasmuch as you have more mainstream motivation — or you have a mix of motivations, not exclusively motivated by particularly unusual EA moral philosophy; you also just want to make the world better in a general way of you want to ensure that we don’t all die and that the world is better for your own children — why would you make EA a big feature of it, if you could just say, “I want to make the world better” also in a common-sense way, and that would be sufficient to justify what I’m doing anyway?

Will MacAskill: I think a big thing is I am not making a pitch or an argument about the brand at all. Like the words “EA,” I have no particular kind of attachment to them. No particular attachment to how people describe themselves. In fact, it’s always been the case that the best outcome is where that idea just feels like quaint.

Rob Wiblin: EA withers away.

Will MacAskill: I mean, I don’t describe myself as a suffragette because I believe that women should have the vote. That is an obsolete term. So similarly, people can describe themselves however they want. The key thing is like, what’s the mindset on which people are operating? Is that a scout mindset? Is that being scope sensitive? Is that being appropriately responsive to how unusual a point of time we’re in and how high the moral stakes are?

Viatopia: an alternative to utopia [02:09:30]

Rob Wiblin: You recently put forward a vision for the near-term future that you called “viatopia.” What is viatopia and what’s the case for it?

Will MacAskill: The situation at the moment is that many of the biggest companies in the world are trying to build AI systems that surpass human ability across all cognitive domains. I think there are good arguments for thinking that this is one of if not the most momentous things to ever happen in human history — much more like the evolution of Homo sapiens or of life itself than even the Industrial Revolution or the invention of electricity or fire. It’s at that level of magnitude.

And yet, essentially no one has a well-formed positive vision for what a good society after the development of superintelligence looks like. And that’s this kind of striking and kind of worrying thing.

Rob Wiblin: Feels like a bit of an omission.

Will MacAskill: Yeah, it feels like a bit of an omission. And the concept of viatopia is at least trying to offer a bit of a framework for what could an answer to that question looks like, of what a good post-superintelligent society looks like.

So the concept of viatopia is that it’s a state of society that is on track to produce a near-best future — something that’s at least 90% as good as a future that we could have. It’s distinctive in that it’s not saying we should try and aim for some utopian society directly. It’s also not saying merely, look at all these bad things that exist in the world: we could solve this particular problem and this particular problem. What it’s saying instead is that we should try and figure out what a good way station looks like — that is, some state society that can steer itself to something truly very good.

And so as an analogy to illustrate: imagine you’re an adventurer and you’re lost in the wilderness. There are a few different options you could take. You could try and take your best guess at what the right path is to get to your destination. Or you could try and just deal, on an ad hoc basis, with some issues you have at the moment, like maybe you’re running low on supplies. Or you could try and get yourself into a position where you know what’s most important to do next and where to go — for example, getting to higher ground so that you can survey the terrain and figure out actually where you’re aiming towards.

Viatopia is like that third path.

Rob Wiblin: And what would be the case for focusing on trying to get to viatopia now, rather than trying to directly create a good world immediately?

Will MacAskill: So utopianism has a pretty bad track record. Philosophers and writers have often tried to sketch visions of utopia, and normally it’s not long before they actually start looking quite dystopian. And the reason for that is that we just don’t know what an ideal future looks like. There’s a lot of moral progress we’d need to make before we could actually say, with confidence, “This is what an ideal future would look like.” So we need to do something else. Otherwise we’ll probably bake in some major moral errors of our own.

Rob Wiblin: Where does the name viatopia [come from]? So “via” means “road” or something in Latin? Or “through”?

Will MacAskill: Yeah, we mean “by way of this place”: via topia.

Rob Wiblin: So this viatopia notion, you told me it’s been very popular, it’s been very well received. Do you worry that it’s a slightly vacuous notion? You’re saying we want to get to a really good future, so we need to get to some intermediate stage or intermediate position where we’re likely to get to that future. Is that a great insight, or is that just kind of a trivially obvious thing, and it’s not necessarily going to actually help us get there?

Will MacAskill: Yeah, good pushback. I think it’s not the most substantive thing, and it’s deliberately a framework concept: it’s for organising our thinking.

However, I think it’s not totally trivial. There is a history of debate on utopianism and other concepts. And the leading ideas were utopianism, a very popular idea, responsible for some enormous atrocities through history.

And the pushback to that — from Karl Popper onwards, but still very popular now; Kevin Kelly, a futurist, has this idea of protopia — is the idea you just don’t have a positive vision of the future at all. Instead you’re doing something more like hill climbing: you’re looking at society now, what are the little things you can change that are clear problems, and then just trying to solve them one after another in this incremental way.

So viatopia is a different way of thinking about things, and I think it leads you towards substantively different recommendations than you might otherwise think — especially over the course of the transition from here to superintelligence.

So if you’ve got the utopian perspective, you might think what we need to do is just make the AI a classical utilitarian, or insert your other favourite moral view, and then just hand over to the AI that’s pursuing that vision of the good. Seems very bad from a viatopian perspective.

Or, and this will be very rough from the protopian perspective, you might just think there are these major issues, major problems in the world — like 100 million people dying every year — and AI will give us the ability to completely solve those problems. So actually we should get there as quickly as possible. And there will be in fact very rough tradeoffs between how quickly we go and how much risk of existential catastrophe we bear over the course of this transition.

And aiming for viatopia might say that actually there’s certain things that are even more important — namely, not locking us into a really bad future — even if that means that we don’t get to some of the upsides in terms of near-term benefits quite as quickly as we might otherwise have done.

Rob Wiblin: So you’re saying protopia — this idea that we don’t want to have a grand vision, that that’s going to lead us astray; and instead we just want to get wins immediately, find ways to improve the world that we can understand and that we can see whether they’ve worked — that would potentially lead us to miss the bigger-picture risks, because we’re just grabbing immediate wins, like trying to improve health? Or it would recommend just charging forward on AI?

Will MacAskill: Or at the very least, it wouldn’t prioritise among them. Where it would say that maybe risk of loss of control to superintelligence or entrenchment of some authoritarian regime, that’s some risk. But there are these clear, apparent evils, such as death and poverty and so on, and we could solve them kind of right away.

Rob Wiblin: Although if you thought that the AI might kill everyone in the near term, that’s also a near-term problem. Although maybe it’s harder to evaluate because it’s more probabilistic?

Will MacAskill: Well, it’s harder to evaluate. And also protopianism at least wouldn’t give you the resources for saying one of these is much more important than the other.

Rob Wiblin: Yeah. Do you think of viatopia as a middle ground between utopianism and protopianism, or is it a different thing?

Will MacAskill: In a sense, it’s a middle ground, in that it is offering a positive vision for where we should be headed. However, it doesn’t have, in my view, the same pitfalls that utopianism has — because it’s compatible with many possible ultimate visions for what a good society looks like, and is not committing to this kind of narrow view of the good.

Rob Wiblin: So what would be the key traits that you would say it’s a viatopian state? What would be the key properties that you’d be looking for, do you think?

Will MacAskill: So there’s the key questions and key properties. I want to emphasise the questions more than my particular answer at the moment — both because the questions themselves are more important, and because my views evolve a lot over time.

But that can include things like: How widely distributed is power? On one end of the extreme, it’s just all power is concentrated in the hands of a single actor, all the way to it’s extremely distributed: global democracy, or even perhaps more distributed than that.

A second is: What sorts of people, what sorts of beings have power? Is it just members of a particular society? Is it just humans? Do AIs have influence over the future? What about future generations?

A third category is: When do major decisions happen? There are some arguments for thinking we need to make really big decisions really quite early. Or instead we should say that actually for the sorts of decisions that will really guide how the future goes, we want to punt them into the future as much as possible.

And then finally, there’s questions around: How should society as a whole be making decisions, and these most important decisions about how the future goes? That could be via democracy, via voting — if so, what sorts of voting systems — or could be via auctions and market mechanisms, and if so, what type?

Those are just some of the things we’ve got to grapple with, I think. And I have views on them, but they evolve.

Rob Wiblin: The analogy that most jumps to mind to me is if you have a group of people starting a new country, they might not yet know exactly what the nature of the law should be, what the political system should be, but they might find they have an easier time agreeing on some process, like a constitutional convention sort of thing, where they come together, and they’ll figure everyone will get some vote — we’ll use this kind of deliberative process and then we’ll use this kind of voting system, and then at the end we’ll end up with some set of agreements of how things are going to run, and the chips will fall as they may. Is that a good analogy to have in mind?

Will MacAskill: Yeah, I think that’s a great analogy. And the US Constitutional Convention at the end of the 18th century is this remarkable event. If I remember correctly, it’s about 40 people in a room debating for three months what should the United States of America look like?

And what they agree on is this set of procedures. And obviously there’s ratifications and amendments after that. It’s interesting too, because there’s this balance between locking in certain ideas, but also kind of locking in a method that doesn’t involve lock-in itself. So you can lock into a certain system that allows a lot of experimentation and free debate and change over time; that’s very different than if they’d chosen a constitution that put a single person or even a single family lineage in absolute power or something. That would have been locking into a different sort of political system, but one with much less in the way of open-endedness and how it could develop over time.

Rob Wiblin: So are there any particularly non-obvious or controversial recommendations that you think the viatopian framing on things would push us towards? Stuff that people might otherwise not like?

Will MacAskill: Yeah, there are certain things that at least I think a viatopia would consist in that is not totally obvious. One which we’ll talk about is I’m very pro distribution of power, whereas a lot of people who worry a lot about existential risk really are in favour of actually quite intense concentration of power. And it’s not an insane view, in fact.

The idea is if you’ve got this period of intense existential risk — in particular, if existential risk can be posed by any of many different actors, whether that’s because they develop a misaligned superintelligence or because they create extremely powerful bioweapons — then you might think we just need a very small number of actors, maybe in fact just one powerful actor, that can guide us through this period.

Whereas I think that’s unlikely to put us into a position where we can guide ourselves to a near best future.

Rob Wiblin: Why is that?

Will MacAskill: I think we’ll talk about it a lot more, but ultimately it’s because I think any single actor probably has the wrong moral conception — even upon reflection, even if they choose to reflect. I think it’s a little worse than that, in fact, because the sorts of people who end up —

Rob Wiblin: You can imagine that one person who has risen to the top and gained supreme power, there’s probably some bad filters that they’ve passed through.

Will MacAskill: Yeah, exactly. And if you look at leaders of authoritarian countries in the past —

Rob Wiblin: It’s a mixed track record.

Will MacAskill: Yeah, that includes Stalin, Hitler, Mao. And the personality traits are just, you know, it’s terrifying. These are psychopathic, sadistic people. They’re not merely randomly selected people who happen to have total power.

I also think that if one person or even a small number of people are in a position of total power, they’re also just less likely to reflect on their values in positive ways. I think that’s something that tends to happen more naturally out of interpersonal interactions and the need to —

Rob Wiblin: Well, especially between equals, I feel. Yeah, I think you notice this even just with people who gain more influence within an organisation or they become wealthy or respected or so on: they stop getting the normal pushback that sharpens their ideas. And you can imagine if you were the supreme dictator forever how disconnected you could become from any reality.

Will MacAskill: Yeah, exactly.

Rob Wiblin: OK, so what are the different categories of viatopia that you think have a shot at working?

Will MacAskill: I think there’s three broad ways of thinking about how we could get to a near-best future.

The first I call “easy eutopia.” This is actually I think the common sense view, which is just that it’s not that hard to get to an extremely good future, something that’s basically as good as you can get: you just need to eliminate the most obvious and egregious bads. So yes, dictatorship would be like that, but eliminate poverty, eliminate suffering and ill health, allow people to have freedom. And that plus just technological development will get us kind of most of the way, or even all the way there. If that’s correct, then viatopia isn’t that interesting actually, because we’ll probably just hit it anyway.

A second view is convergence. On this view, you would need to have most of society with power converging onto the right kind of ethical view. I’ll sometimes use “correct ethical view” or “correct moral view.” You can also just say this in more anti-realist, subjectivist terms — like “the view I would have upon idealised reflection” or something. It’s easier just to say “correct” or “best.”

Rob Wiblin: And they have to be motivated by it as well, right?

Will MacAskill: And they have to be motivated, yeah. So in this idea of convergence, yes, maybe the best future is a narrow target. Nonetheless, if we can get it such that most members of society, or at least most people with power, converge onto the best thing, the best moral view, and steer towards it, then nonetheless we’ll hit the narrow target. But that is necessary.

And then the third vision would be what I call compromise, which is that you don’t need everyone. In fact, maybe even if you just got a small fraction of people who have the right kind of ethical views and are motivated to pursue them, and the right kind of broad philosophical perspective and understanding of the world as well, and they’re able to kind of trade with the rest of society, that is sufficient to get us to a near-best future.

My view at least is that this third option is kind of the most promising thing to steer towards.

Rob Wiblin: So we’re going to skip over the easy eutopia scenario here today. You have an article on the Forethought website called “No easy eutopia,” where you argue that is not plausible — in brief, because I think we both agree that the best possible world is not just a matter of removing bad things, but it’s also about adding lots of the best possible thing as well. And probably the best thing is better than nearby things, so it’s quite a narrow target to hit.

And I guess we’re not going to talk a tonne about the “everyone, when they reflect on moral philosophy, they reach the correct theory and they’re motivated to spend all their resources operationalising it.” Do you want to say anything quickly about why you don’t think that is super likely to work?

Will MacAskill: Yeah. There’s lots to say, but I just think there’s multiple ways it can fail, even if we’re in a reasonably good scenario. One is just that people can be uninterested in reflecting or they can reflect in the wrong ways. Or they can even have a good reflective process but just have bad starting intuitions — where from those intuitions, even with good reflection, they’ll end up in the wrong place.

I’ll say I am somewhat sympathetic to the idea that maybe quite large swathes of people actually would converge in the same direction. I think if that’s true, it’s because of the nature of reality. It’s because of, in my view, something kind of moral realisty being correct: just the argument is very strong towards one particular ethical view. Or if you just experience this particular conscious state, you can’t help but believe that it is good, because it is in fact good.

That’s the sort of scenario I think we’d have to envisage. But wow, I don’t think we should be confident in that. And in fact I have really quite wide uncertainty over how much convergence you would get — all the way from actually just large swathes of people would converge, that’s again a really good scenario, all the way to just like no one converges after reflection: all 8 billion people in the world would have quite different views of the good.

Rob Wiblin: Yeah. You missed that you could get all of it right, have everyone conclude the correct moral theory, but nonetheless not be interested in putting their resources there. Just be like, “I just want to do my own thing. I don’t care about doing the morally good thing.”

Will MacAskill: Yeah. And in fact, that’s I think the most likely failure. You know, you can go to people and give them the arguments for vegetarianism or donating, and they can say, “Yep, all those arguments work,” and then just not take any action on it.

And in fact, it’s not like we see today people investing lots of time and lots of money into ethical reflections and reading counterarguments and so on. It’s just not really something that happens. You’ve got to be quite weird and unusual to do that.

And in fact, maybe some people would want to guard against them. Imagine fundamentalist religious believers or people who are very wedded to particular ideologies. They might say, “I don’t want to risk my adherence to my faith.” Or, “Oh god, it would be abhorrent of me to even consider this alternative position.” And with future technology, we would be able to guard our informational environment or even self-modify, such that we don’t even consider these alternative perspectives.

Rob Wiblin: Yeah. So just setting the scope even clearer: we’re mostly not going to be considering cases of catastrophic misalignment and really deeply scheming artificial intelligence here. Not because that’s not a possible option or a very live possibility, but just because we only have five hours to record in, and it raises a whole lot of separate issues. It’s worth imagining what happens if we mostly overcome that one way or another.

So yeah, let’s dive into the third option, which you thought was most promising, which you call like compromise, trade. This is a scenario where, as I understand it, you have some meaningful minority of people weighted by like power or resources, who converge on wanting the right thing for its own sake, and they’re willing to allocate some meaningful fraction of all of their effort towards that.

So let’s say it’s 10% of resource- or power-weighted folks want to pursue this goal. You want to try to spin this into more than 10% of the best possible future that there could be. How might they accomplish that?

Will MacAskill: I think there’s kind of two big ways. One is if different groups care about really quite different things. So the greatest example perhaps could be maybe some groups, upon reflection, just value resources basically linearly. So a total utilitarian would be like this, because the more resources you have, the more happy lives you can create. And the value of the universe as a whole is in proportion with how many happy lives. Other views that are perhaps more common sense-y might be very different to that, so might just care about preservation of the Earth’s biosphere, or might discount over time and space, so care about what things happen near to them; or might really just care about guarantees of good outcomes or very high probability of good outcomes, rather than risky gambles of even better outcomes.

And this gives lots of opportunity for trade. In this case, there could be a deal which says: you’ve got the common sense person. They say, “OK, we’ll steward resources that are nearby in space and time, and this total utilitarian, sure, you can go to other star systems and then create this much more ambitious, expansive world with many kind of happy beings.” And then perhaps in fact both can get 99.99% of what they would ideally want if they had complete control over everything.

And that’s this very exciting kind of potential opportunity. Because it means that then, if we can get into the scenario where we’ve managed to get these beneficial gains from all these different kinds of ethical factions trading with each other, then we don’t need to pick a winner. It’s robust to disagreement, and it’s therefore a much safer option than either just hoping we all converge, or pushing some particular view of the good.

Rob Wiblin: Do you think that things would play out that way, or is that a viable vision?

Will MacAskill: I think there are risks to even getting that. One would be if there’s intense concentration of power. A second would be maybe such trades aren’t allowed. So there’s lots of things that you’re not allowed to trade at the moment. It’s possible it’s just the best stuff. So maybe the total utilitarian likes some particular blissful state, and those people are in the minority, and society says, “No, that’s illegal.” You know, there’s already lots of things that in my view would be just ethically fine, but are impermitted today.

The bigger issue I think is that maybe there’s lots of groups who have relatively easy-to-satisfy views of the good — like preservation of the Earth’s biosphere or preferences for things that are kind of local. But I think there’ll be a lot of people who actually just do care about things linearly, and there it’s much harder to see initially why you would get these huge gains from trade.

So I said the total utilitarian says, “I just want there to be as many happy flourishing lives as possible.” But now let’s distinguish within that: there’s utilitarian type one, utilitarian type two, and perhaps they differ on what they understand flourishing to consist in, what they think the best conscious experiences or lives are. In order for there to be good deals from trade there, it would need to be the case that there’s some kind of hybrid life that is more than 50% as good on both views. And it’s speculation to say how likely is it that there would be or not. My guess is that in general there probably wouldn’t be, because my guess is that the very best things from a utilitarian perspective will be way better than things that are just a bit less good.

Rob Wiblin: I thought the archetypal case here might be you’ve got Faction A, Faction B. Let’s say Faction A, they’re the utilitarians: they want pleasure, no suffering. You’ve got Faction B that wants something quite different, and Faction B incidentally might cause a whole bunch of suffering in pursuit of their other goal. But the suffering is not something that they value for its own sake; they’re just doing it because it makes their project somewhat more efficient. And then Faction A could basically pay Faction B to redesign their thing so it doesn’t involve suffering, incidentally. Is that a kind of thing?

Will MacAskill: That would be a case. And in the world today, that sort of thing happens. I do think that if we had much better opportunities to make such agreements, we had better coordination technology or something, the vegans and vegetarians and people concerned about animal suffering could just engage in some sort of trade with the people who like eating meat. Perhaps there wouldn’t be enough bargaining power to eliminate farming altogether, but I think it could eliminate factory farming. You know, most animal suffering could just be abolished because, as you say, people aren’t really aiming for that directly. It’s just a side effect.

My guess is that when we’re now thinking about these very grand scales, that’s not going to be super common, or at least there will be a lot of residual incompatibility left over — because you’re just trying to produce happiness type 1 as much as you can, and I’m trying to produce happiness type 2; I think that your understanding of happiness has basically no value, but it’s not like you’re producing lots of suffering.

Rob Wiblin: It’s just valueless.

Will MacAskill: Yeah, or it’s like a tenth as valuable or something, and similarly vice versa.

Rob Wiblin: OK, we’ll push on from this. I guess we should just quickly note that there’s a wrinkle with this kind of moral trade or a challenge that for example, if we did start paying people to close down their factory farms or to redesign them, then you would be vulnerable to someone saying, “Well, I’m going to open up the worst possible factory farm unless you pay me.” And you wouldn’t know whether they would have done it otherwise. I guess they could pretend that they’re not doing it to blackmail you, basically, but in fact they are. Possibly, in this starfaring future, maybe that wouldn’t be such an issue, or maybe it would be a much worse issue. We don’t really know.

Will MacAskill: Yeah, and I should flag this is my biggest worry with the whole widely distributed power and trade and so on: vulnerability to those sorts of extortion/blackmail dynamics. There’s this very substantive project to work out what’s a good system where people who self-modify or pretend or use blackmail or extortion are not rewarded for doing so, but you still get these other beneficial gains from trade.

The least bad alternative to total utilitarianism? [02:39:35]

Rob Wiblin: OK, let’s push on to some honest-to-god philosophy, or at least what analytic philosophers would regard as philosophy. You’ve been working on a pet moral philosophical theory that you call “the saturation view.” What problem in normative ethics are you trying to address with the saturation view?

Will MacAskill: So this is kind of a set of problems, in fact, within population ethics. It’s a well-known area of ethics for generating all sorts of paradoxes, cases where you’ve got lots of individually extremely plausible principles that end up inconsistent with each other.

And there are a number. There’s what’s called the mere addition paradox, where some intuitively plausible principles end up leading you to what Derek Parfit calls the “repugnant conclusion”: the idea that you could start off with a trillion trillion extremely happy people, and that outcome might be worse than a population that consists only of people with lives barely worth living as long as there’s a large enough number of them. So that’s one of the problems.

The second is the problem of fanaticism: you again start off with this guarantee of this amazing outcome, and now take a tiny, tiny probability of something that’s even better, sufficiently good. When combined with expected utility theory, many views will say take the gamble. No matter how small the probability, there’s some sufficiently good outcome that you should take it.

Rob Wiblin: Because it’s risk neutral, basically.

Will MacAskill: Because it’s risk neutral with respect to total quantity of happiness or something like that.

A third category of issues is infinite ethics. I think we definitely won’t have time to get onto that side of things, but it’s something that’s really plagued this kind of impartial consequentialist approach to ethics or axiology.

But there’s also a fourth problem, in my view, which hasn’t been discussed in the literature, which I call the monoculture problem, which is: let’s try and figure out what’s the best possible future. What does that look like? Remarkably, all the extant well-specified theories of population ethics to date say that the best future, if you’ve got a fixed amount of resources, involves figuring out what’s the life that would produce the most wellbeing with a given amount of resources to create, and then just make copies of that life over and over and over.

Rob Wiblin: Tile the universe.

Will MacAskill: Yeah. So in EA and rationalist worlds, it sometimes gets called “tiling the universe with hedonium” — where hedonium is the whatever produces the most bliss per unit of resources. But the general idea is just what it wants is a monoculture, because this is the thing that has the most wellbeing. And if you just have that repeated forever, you’ve also got this perfectly equal society, so it’s good on egalitarian grounds too.

Rob Wiblin: Well, it seems like it’s a very natural attraction point, because any theory that says that there’s a best thing, and that thing is not universe scale, is going to say that if it’s smaller, just make it and then make it again and just keep going. It seems like you almost have to hard code in a preference against this to avoid the monoculture, which most people find quite unattractive.

Will MacAskill: Yeah. It actually also follows a couple of principles that are generally regarded as axiomatic in population ethics. There’s like a very simple kind of proof you can make from it, from these principles.

However, I at least find that unintuitive. I would think that a future of just replicas of the one qualitatively identical life is not the best possible future, and a better future would involve a wide diversity of different forms of life and experiences and so on. I think that’s not just an intuition that diversity or variety is instrumentally valuable, or an intuition that’s saying we don’t know what’s valuable so we should hedge our bets. Instead, I think it’s just that actually that’s a better future.

Rob Wiblin: Placing intrinsic value on variety?

Will MacAskill: Yeah. Or something that has that implication. So it could be — I mean, this might just sound like the same thing, but I think it’s slightly different — that the realisation of a particular experience or form of life has value in itself over and above just the mere wellbeing. But either way, a very diverse and varied future is better than this monoculture.

Rob Wiblin: Yeah. It’s surprising to me that this hasn’t come up in the philosophy literature very much, because online, whenever people talk about what are we going to do with all the matter and the energy, and then anyone suggests something that is very monotonous, just repeat the same thing, people are like, “I don’t like that. Sounds horrible. Sounds crazy and terrible.” But I guess philosophers, because I suppose the prospect of changing all of the galaxies out there hasn’t really been on the table before, it hasn’t really come up that we need to figure out a solution to this.

Will MacAskill: Yeah, I think that’s right. I have found actually, over and over again, that being really concerned by figuring out how do we do as much good as we can has ended up driving all sorts of interesting philosophical areas and issues that are otherwise being neglected, because most philosophers are not thinking in that same way.

Rob Wiblin: So what is the saturation view? How does it address this?

Will MacAskill: The saturation view is a way of incorporating the idea that diversity is intrinsically valuable, by having the thought that if you have a replica of a life, so a qualitative copy, that’s just less valuable. And in fact, more and more copies of that life is progressively less and less valuable in a way that kind of tends to some upper limit. And generalising that a bit: for the same reason, maybe it’s not an exact copy, but slightly different, that’s also a bit less valuable than some totally new kind of form of life.

And the analogy could be like: imagine a kind of colour wheel that’s initially not lit up at all. And different sorts of life will experience different spots on the wheel, and by adding lives, you’re kind of lighting up those little spots. Whereas a kind of traditional population axiology would be saying you have the best thing, and just over and over again you want to produce that best thing. Instead, on the saturation view, you want to light up the whole wheel. Because I’ve had many copies, let’s say, of these very similar lives, that means the additional lives are not adding as much value. So you get more value by instantiating some totally different form of life or form of experience.

Rob Wiblin: It’s a very natural formalisation of this intuition, that you’re just saying you hit declining returns on stuff if they’re too similar. Like, you’ve got something that’s good, but making another copy of it isn’t as good as the first time, and also something that’s too similar to it takes a bit of a haircut if there was something else that was too similar to it in the past. I guess they never become useless; they just become less and less valuable incrementally.

Will MacAskill: Exactly. Yeah, there’s never a point when then you get no additional value. But the amount of value each kind of copy produces gets smaller and smaller.

Rob Wiblin: Does it asymptote up to some maximum value?

Will MacAskill: Yes. As part of the view, it asymptotes. And that’s a really crucial part of it, actually.

Rob Wiblin: OK. And do you have any difficulty defining what the hyperspace is over which you’re considering whether things are different from one another, or are you just going to set that aside?

Will MacAskill: In my work so far, I don’t talk a lot about what exactly is this space of different lives, and how many dimensions does it have and so on. I make some kind of formal assumptions about it, but my view in general is like, let’s just start off by looking at the kind of formal structure of this view and all of the nice properties it has. And then afterwards we can then start arguing about it, because it would involve trading lots of different intuitions and so on, but I don’t think it’s really affecting the biggest picture.

Rob Wiblin: So what are its nice properties?

Will MacAskill: So going back to these different problems: let’s start with this monoculture. Very clearly, just doesn’t lead to a monoculture. And in fact, you would want this very rich, diverse future — that would be better.

In the variant of the view that I formulate, it dissolves the mere addition paradox.

Rob Wiblin: Why is that?

Will MacAskill: It involves one extra structural assumption. Again, emphasising that the point is to find some theory that is not like the total view and avoids its problems. But if all lives that have very low wellbeing, or all experiences, depending on how you’re aggregating it, are only a small part of the space of the overall landscape of possible lives or experiences, then once you appropriately reformulate the underlying principles that generate the paradox — because these have to be, philosophers would say, “ceteris paribus” principles, so “other things being equal” principles — so saying holding diversity fixed, then it’s not bad to make some people’s lives better and add lives that are good. And holding diversity fixed, it’s not bad, or it’s in fact good to have more wellbeing and more equal.

It turns out that the view can have the implication that you satisfy all of those principles, rejecting the repugnant conclusion, accepting this dominance principle and this “egalitarian plus increasing wellbeing” principle. But you do not ever entail the repugnant conclusion, because the thought is that all of these low-wellbeing lives or low-wellbeing experiences just can’t add up to enough diversity worth having. So in each of the steps of the paradox, you’re kind of adding people and then trying to kind of rebalance the wellbeing — but then there’s a step where you can’t do it; there’s just no world that will in fact satisfy that step.

Rob Wiblin: OK. I didn’t follow that, but that’s OK.

Will MacAskill: It’s a little bit hard to convey on a podcast. And in fact, much of the paper is not even giving the view to begin with, because the view gets mathematically quite intricate. In fact, it’s just giving a toy version of the view and then working it through.

Rob Wiblin: Yeah. So I think the main reason that I’m not super drawn to this is that I don’t have the same intuition. I don’t have the intuition in favour of variety as strongly as many people do.

So of all of the problems with total utilitarianism or any views like that, the thing that I find most troubling is the risk neutrality between positive and negative experiences. I find that deeply disturbing, because it’s never something that I would choose for myself, that I would be indifferent about a life that’s extremely good and extremely bad, each with 50% probability. So that’s super counterintuitive to me. But the idea of making something that’s really good and then making a lot of it I don’t find as peculiar.

Will MacAskill: Well, I just wanted to ask, on your views, you said risk neutrality. I mean, you could just have a negative-weighted utilitarian view, where let’s say bads count for 1,000 times as much as goods or something. But you’re still risk neutral with respect to that.

Rob Wiblin: Yeah. So that is more attractive, I guess. It’s a little bit hard to know are you changing the weighting of the badness, or are you just correctly assessing that the badness is really worse?

Will MacAskill: Yeah, yeah.

Rob Wiblin: But yeah, I think that makes more sense to me. That’s more how I would make the decision: that you just weight the bad stuff really more, because it’s like debunking explanations for why humans would have this intuition that we’re more capable of suffering a lot in an hour than we are of experiencing pleasure in an hour.

Will MacAskill: Yeah. I’m wondering if you also have worries about the risk neutrality aspect because that’s where in the most extreme, combining it with the suffering cases, you start off with a trillion trillion lives of intense bliss — so a trillion trillion lives are absolutely amazing: option A. Option B is a trillion trillion lives of intense suffering, worst possible suffering, plus some one-in-a-billion-billion-billion chance of an extremely large number of lives just barely worth living.

The total utilitarian combined with expected utility theory or expected value has to say that the latter is better than the former, as long as the number of lives are large enough.

Rob Wiblin: So all we’re doing is adding a whole lot of just barely worth living lives, and that’s way better?

Will MacAskill: Yeah. So world A is a trillion trillion bliss utopia world. And then gamble B, it’s a gamble: there’s a guarantee of a trillion trillion intense suffering plus just epsilon probability of all of these lives that are just barely worth living, but it’s just a very large number of them.

Rob Wiblin: Yeah, I foresee that you’re just going to throw out an edge case like this no matter what I say. [laughs] You have too much practice with this. I mean that is also very unattractive to me as well.

Will MacAskill: OK, OK.

Rob Wiblin: So yeah, did I interrupt? I think you were going to go somewhere with this. Your view helps with this?

Will MacAskill: It’s just because you mentioned risk neutrality and that was one of the problems that I mentioned was this fanaticism: where no matter how small the probability, you really care about that, and as long as the sufficiently good payoff is big enough, you will pursue that tiny probability of an enormously large payoff.

And this view avoids that, because it ends up being bounded. So basically, as long as the landscape is either finite or a certain feature of it decays fast enough, then there’s an upper limit to how much good you can create. Intuitively, again thinking of this colour wheel, you’ve fully illuminated as bright as possible the landscape. That’s the upper bound, so you avoid fanaticism.

And then I’ll briefly say, but not explain why: for the same reason, I think it has quite a range of desirable properties even with infinite populations too. Many consequentialist views, like the total view, naturally lead to a lot of paralysis where you can’t even compare intuitively comparable worlds. This does not have that implication.

Rob Wiblin: OK, so I guess that is legitimately attractive. The two things that struck me as odd about the view, or less attractive about the view, were on the negative side: if you’re also saturating there, it’s even more bizarre that you would say that we’ve already had so many people suffering in this very specific, torturous way, so adding more of them, who cares? It’s too similar to existing things to be that bad. It feels even more clear on the negative side that it’s just linearly bad to have more and more people having horrible lives.

The other thing is, let’s imagine that we went about this project, that we’re going to turn the sun into whatever we think is morally best, or turn the solar system into this thing that we think is fabulously, morally good. But then we make this discovery that we think that aliens elsewhere in the multiverse, like a long time ago or a long time in the future, they did something that was really similar. We’ve simulated it, and we think that they already made this before. We’d be like, “Shucks, we wasted our time!” That non-separability, the fact that the value of what we do is connected to things so distant, isn’t intuitive to me.

What do you make of those two things?

Will MacAskill: Both super important points. And yeah, the negative side is, in my view, by far the most unappealing aspect. And then you end up with, you’ve got to kind of pick your poison, unfortunately.

Let’s come back to that, because on the separability side is this principle called separability, which is basically just if I’m comparing A and B, two different outcomes: suppose there’s some background population in distant time, distant space. It’s irrelevant to whether A is better than B. It’s irrelevant what that background population is like.

Rob Wiblin: Yeah, you can go like +C, +C and then cancel them out.

Will MacAskill: Yeah, exactly. And I agree also that that’s quite intuitive. Separability is intuitive. If you endorse separability in conjunction with just standard I would regard as like technical assumptions, you have to endorse either the total view of population ethics, which is just add up all the happiness, or the critical level view, which is just add up happiness but minus a bit.

Rob Wiblin: For each individual?

Will MacAskill: For each individual, yeah. So if someone had wellbeing 10 and the critical level was 2 or something, then adding them to the population would have +8. And these views have all of these problems that we said to begin with. They differ on the repugnant conclusion. But the problems are really bad, and seemingly unintuitive in both cases.

So that’s one thing to say: we’re going to have to suffer a violation of separability.

The second is that the diversity intuition is fundamentally an intuition about separability. Because it’s like looking at the pattern of different sorts of life, saying, “We’ve already had a lot of this thing, so it’s more valuable to have something new.”

Rob Wiblin: I think it might be because these things are so linked in my mind that it’s not as counterintuitive, the homogeneity thing. I guess if you haven’t thought about this before, they seem like separate issues almost, and you only realise on reflection that they’re deeply connected.

Will MacAskill: Yeah. Because there are some cases where a violation of separability seems fine. So in one’s own case it’s like, “I’m going to climb Mount Everest, and that’s going to be this amazing achievement!” And then if someone’s like, “Oh, you forgot. You actually climbed Mount Everest last year.” Like, “Oh, did I?” “Yeah, you knocked your head and you got amnesia.” You might well be like, “Oh, OK. Well…” I mean, it’s a bit unclear.

Rob Wiblin: Yeah. I mean, if the experience would be the same, I would do it again. I’d be like, “Great. Well, I can do it again because I forgot.” But I think most people wouldn’t probably.

Will MacAskill: Yeah. I am actually getting some people to run a survey to see how people’s intuitions are about different things.

Rob Wiblin: Which poisons people prefer to drink from this medley.

Will MacAskill: Yeah. But I’m also actually not claiming that this new view is the best view. I think I’m saying if you want to reject the total view, this is your best option.

The last thing I’ll say on this separability is that we said that all views, other than total view and critical level view, have to violate separability if you satisfy certain technical axioms. I think the saturation view violates it in a less bad way, because the vast majority of the time it’s separable. So if the populations are different parts of the landscape, then you can just add it up. You know, you add up the value of this population, the value of this population. So it endorses this kind of limited separability principle.

And then secondly, depending on how you define it, you could keep it such that it’s all approximately linear until the population size gets really, really big. Then it looks approximately like the total view in most scenarios up into cosmic scale.

Rob Wiblin: Or even like intercosmic scale if we’re doing EDT [evidential decision theory].

I guess I’ve seemed a little bit unenthusiastic about this so far, but I think it’s amazing. Like, surely this is going to end up being a big deal, or surely this has got to be one of the top theories within this entire space, don’t you think? I mean, I don’t find it attractive, but I think that many people will choose this as their population axiology once presented with it.

Will MacAskill: I should say that I’m not at all claiming that this is the highest impact use of my time, because I think a lot of this work can just be punted till AI gets better and so on. But it is the idea that I’ve been most taken with, like most just obsessed by, in my life. And I think from a purely intellectual perspective, I reckon it’s my best contribution.

It also just makes me appreciate actually how few population axiologies have been proposed. The options are really quite weak. Most of the work that happens is more… Very few people are like, “Here’s a view, here’s a theory, and this is how it all works.” In a way, it’s surprising.

Rob Wiblin: Yeah. Is anything published about this yet?

Will MacAskill: So my plan is to finish up… I’ve done this kind of sprint on what was meant to be the blog post summary, but it’s 13,000 words. So I think I’m just going to be like, OK, this is a draft article. And yeah, my plan is to publish that in the next few weeks.

Rob Wiblin: OK, excellent. Well we’ll stick up a link to that.

Will MacAskill: OK. And very kindly, you’ve not gone back to the negative, how it deals with very negative worlds, intense suffering and so on. But I’m happy to acknowledge that’s very plausible implications in that case.

How AI could kickstart a golden age of philosophy [03:03:35]

Rob Wiblin: You mentioned earlier that you used AI a tonne to do this work. Tell us about that.

Will MacAskill: Yeah. This is part of the reason I think I’ve been so taken and obsessed by this idea — like I was working on it, like I was on holiday and stuff, just doing as much as I could in my spare time — is because of the amazing, in my view, uplift of AI on analytic philosophy in particular.

So how helpful is AI for research? Well, extremely spotty: if you want to learn about some weird area, it’s amazing. If you want it to help do certain areas of macrostrategy research, it can be essentially useless.

In the case of at least this formal end of analytic philosophy, it’s so good. And honestly, credit where credit’s due: it’s almost all ChatGPT Pro, so now 5.2 Pro, where I think I wouldn’t be saying any of this if that particular model didn’t exist.

Rob Wiblin: Huh. Gemini or Claude are not at the same level?

Will MacAskill: Well, I think a big part of the reason is it just thinks for longer.

Rob Wiblin: Is this the $200 per month one?

Will MacAskill: Yeah. I now pay by credit, so I actually spent $1,000 in the month I was most working on this.

Rob Wiblin: A bargain at the price!

Will MacAskill: But yeah, it will think for… I’ve had it think for 70 minutes, is my peak so far.

Rob Wiblin: And it really does deliver better answers?

Will MacAskill: Well, here’s what’s going on. Because I’ve talked to other researchers who really don’t get that much from it, and I think what’s going on is the problems within, say, population ethics are very well specified. There’s a big literature, which the AI has digested, and it’s also an area where it has been specified enough that it’s amenable to mathematical analysis. But very few mathematicians have actually looked at it; it’s mainly philosophers who maybe did maths in their undergrad. The exceptions are a handful of economists, and Teru Thomas — who is a mathematician who moved into analytic philosophy, and in fact has done the best, in my view, maybe almost better work than anyone on population ethics.

So there’s this big overhang of capability that the AI is getting from its being trained to be very good at maths. And in my own case, I had the kind of core insight like a year and a half, maybe two years ago now, something like that. And then I was exploring it. I talked to Toby Ord and Christian Tarsney — and I should say that, if we publish a paper on this, it’ll be coauthored with Christian.

And yeah, the initial kind of thought was specified in a way that obviously didn’t quite work. And there was an obvious way of kind of specifying it in a discrete form. And it was like, there must be some continuous form of the theory that would work.

And I just don’t have mathematical training. It’s kind of beyond me. AI does. So then it felt like really getting this kind of rocket booster, where I’d be like, “No, I want it to work like this. This.” It’s like, “OK, cool.”

Rob Wiblin: Did you have difficulty checking the answers that it gave?

Will MacAskill: There were challenges there, because I’ve definitely been slower. I use many AIs checking it and itself in many cases. One thing that AI is still pretty bad at is just keeping a tight hold on concepts. So it might define something in one way on page three and then page eight, it’ll define it in some other reasonable but different way. It doesn’t necessarily notice.

But it’s much easier to verify something than to come up with it yourself, and a lot of the time it’s just using concepts where it’s like, I didn’t know what a kernel is, and it’s not that complicated once you learned it. But then I wouldn’t have even known where to go.

Rob Wiblin: Where to look. My impression from Twitter is that AI is now starting to make useful contributions in maths specifically. I think it’s not amazing stuff yet, but we’re seeing the early signs of it producing stuff that might be publishable. Do you think the same thing might start happening in analytic philosophy, given that at least some parts of it are basically kind of like maths with words?

Will MacAskill: Yeah. Honestly, I think a big question is just whether analytic philosophers take the opportunity. I’m very curious about doing this as an early testing ground for AI for macrostrategy as a whole.

But also, this is the kind of best case. There have been other cases where AI in one case just gave me a definition that just was really good. Again, it’s a kind of formal definition. In other cases, I’ve had it give just like really quite good informal definitions of things. Another case in which it just came up with a good critique: I was just like, “Here’s a view, generate as many counterarguments as you can,” and it comes up with 20 and most are bullshit and one it’s like, oh, that’s really on point.

So my take is that we’re entering this golden age of analytic philosophy, potentially — especially at least on the more formal end, where people could become 2x, 4x more productive.

Rob Wiblin: Does it need lots of hand holding? I mean, at the point where one person can just be like, “Here’s a set of problems, here’s a £100,000-pound compute budget. Have at it, ChatGPT,” then you don’t need the field as a whole to change. It’s like that one person just ends up owning the entire discipline.

Will MacAskill: I think analytic philosophy is small enough that there’s a question of, does one person or not do it? But yeah, I expect the field as a whole to be very slow to appreciate, but some people will be really on top of that.

Rob Wiblin: Yeah, I guess I’m saying if it requires constant hand holding to make any progress and to structure its thinking and so on, then that is a bad sign. Or that suggests that, unless many people in the field get massively enthusiastic, which probably won’t happen, then…

Will MacAskill: Oh yeah, and I think that is right. So Christian again, who I’m planning to coauthor with, we were working. He’d had this other idea, which ended up being quite different, for how to kind of extend the idea. And I was like, “Oh, you’ve got to use AI! GPT-5 Pro is so good. It’s worth $200 a month.” And then he had this hypothesis conjecture, and then the AI was like, “Oh yeah, I proved it for you. Blah, blah.” It’s like, “No, no, no! It was very complicated.” So it’s like, “Oh, I need to assess this sort of thing.” But it was just…

Rob Wiblin: It was wrong. Hallucinating.

Will MacAskill: Or reward hacking. So there’s a tonne of —

Rob Wiblin: So it’s really quite a skill to drive, I guess.

Will MacAskill: Exactly. Yeah. You’ve got to have this intuition of when’s it bullshitting you and when is it not? And that will pose an increasing issue.

I guess there’s a couple of things. One is like sometimes it just flat-out thinks it’s proved something and it hasn’t. Another is like, often it’s like, “I’ve got this proof.” And then you wade through it, and it’s like one of the assumptions is very close to being proved. So it’s these classic things that everyone finds, like it’s lazy, it’s eager to please. And so there’s a lot of skill in terms of just intuition and when it’s going to work well and when not.

And it’s interesting. When have I ever just had an AI output and then just actually read it in the same way I read a human piece of text? I’m like, never. I think maybe never, because I skim through and then I’m like…

Rob Wiblin: Yeah, yeah. I suppose there probably is a growing gap between people who have been using this stuff all the time, I guess like you and me over the last year. Because I think maybe part of the reason why other people are sometimes not as impressed is that they just haven’t built up these intuitions for what kinds of things work and what the failures are going to be, and what they should be looking for for something to be wrong.

So it sounds like it’s slightly mixed on whether we’ll have a flourishing of analytic philosophy in the next few years. But you said that macrostrategy, the kind of stuff that Forethought does, you’ve found it to be maybe less useful. More touch and go?

Will MacAskill: Oh yeah, much more touch and go, and much more of a mixed bag. So there are some ways in which macrostrategy and AI is amazing uplift, because often the work just involves needing to know a little bit from all sorts of different disciplines. So even kind of early like GPT-4 kind of thing, you’d say like, “Are there any interesting experiments that you can only do in space and can’t do on Earth?” And then it’d be like, “Well actually, because gravity interferes with certain crystalline formation…” And I’m like, I would have never been able to get this otherwise. So that’s like totally random bits of science and information, it’s fairly useful.

Incredibly useful for just when you need to generate a lot of examples. So with this AI character work, just like, “I need a tradeoff between these two virtues or something. Give me lots of examples.” And it can just generate large quantities of them.

But then if there’s some kind of gnarly question, or you need to be really precise — like if you’re actually drafting certain principles for how AI characters should behave — and then certainly on the insight side of things, so obviously a big part of the value… I think it just doesn’t really know what doing good macrostrategic thinking looks like. And so instead you get something that feels like a management consultant or like maybe a high school essay.

I mean, I think it’s still getting better and getting more useful, but I feel quite aware of the things where there’s an existing literature and where there isn’t.

Rob Wiblin: Cool. Well, sounds like your job is secure for another year at least.

Will MacAskill: Six months.

Rob Wiblin: Yeah, six months. I think we’ve touched on about a third of the stuff that Forethought has put out over the last year, so if people like this and they want to read more, then at forethought.org you’ve got a research page. There’s a lot of really interesting macrostrategy work on there that people should check out. I found it fun reading through.

Will MacAskill: Well, thank you. It’s been great being on here. I’ve really enjoyed the conversation.

Rob Wiblin: My guest today has been Will MacAskill. Thanks for coming back on The 80,000 Hours Podcast, Will.

Will MacAskill: Thanks for having me.

Learn more

Risks from power-seeking AI systems

Extreme power concentration

Think tank research

Philosophy academia

Related episodes

March 11, 2025

#213 – Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared

Listen now

August 15, 2022

#136 – Will MacAskill on what we owe the future

Listen now

April 16, 2025

#215 – Tom Davidson on how AI-enabled coups could allow a tiny group to seize power

Listen now

May 5, 2023

#150 – Tom Davidson on how quickly AI could transform the world

Listen now

March 17, 2026

#239 – Rose Hadshar on why automating human labour will break our political system

Listen now

February 24, 2026

#236 – Max Harms on why teaching AI right from wrong could get everyone killed

Listen now

April 4, 2025

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Listen now

December 19, 2025

#232 – Andreas Mogensen on what we owe ‘philosophical Vulcans’ and unconscious AIs

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

The episode in a nutshell

AI character is the most underrated lever for a good future

Risk-averse AIs could dramatically reduce takeover risk

Viatopia: aim for a good way station, not utopia directly

Multiverse coordination and a new theory in population ethics

Effective altruism in the age of AGI

Highlights

AIs' "characters" could be vital to securing a good future

How opinionated should AI be about ethics?

Risk-averse AI would rather strike a deal than attempt a coup

Will favours distribution of power

EA in the age of AGI

Articles, books, and other media discussed in the show

Transcript

Cold open [00:00:00]

Will MacAskill is back — for a 6th time! [00:00:29]

AIs’ “characters” could be vital to securing a good future [00:00:59]

The panic over sycophancy is justified [00:08:11]

How opinionated should AI be about ethics? [00:13:24]

Commercial pressures won’t fully determine AI character [00:30:54]

Risk-averse AI would rather strike a deal than attempt a coup [00:38:13]

A coalition of democracies building superintelligence is safer than one doing it alone [01:09:26]

How selfish agents could fund the common good [01:22:19]

Why not push for pausing AI development? [01:42:17]

Effective altruism is making a comeback post-SBF [01:52:19]

EA in the age of AGI [02:00:28]

Viatopia: an alternative to utopia [02:09:30]

The least bad alternative to total utilitarianism? [02:39:35]

How AI could kickstart a golden age of philosophy [03:03:35]

Learn more

Risks from power-seeking AI systems

Extreme power concentration

Think tank research

Philosophy academia

Related episodes

About the show

What should I listen to first?