#242 – Will MacAskill on why AI character matters even more than you think

Hundreds of millions already turn to AI on the most personal of topics — therapy, political opinions, and how to treat others. And as AI takes over more of the economy, the character of these systems will shape culture on an even grander scale, ultimately becoming “the personality of most of the world’s workforce.”

So… should they be designed to push us towards the better angels of our nature? Or simply do as we ask? Will MacAskill, philosopher and senior research fellow at Forethought, has been thinking through that and the other thorniest issues that come up in designing an AI personality.

He’s also been exploring how we might coexist peacefully with the ‘superintelligent AI’ companies are racing to build. He concludes that we should train such systems to be very risk averse, pay them for their work, and build institutions that enable humans to make credible contracts with AIs themselves.

Will and host Rob Wiblin also discuss what a good world after superintelligence would actually look like — a subject that has received surprisingly little attention from the people working to make it. Will argues that we shouldn’t aim for a specific utopian vision: we don’t know enough about what the best possible future actually is to aim directly for it, and trying to lock in today’s best guesses forever risks baking in errors we can’t yet see.

Will and Rob explore what we can do to steer towards a good future instead, along with why a coalition of democracies building superintelligence together is safer than any single actor, how absurdly useful ChatGPT is for analytic philosophy, and more.

This episode was recorded on February 6, 2026.

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Music: CORBIT
Camera operator: Alex Miles
Production: Elizabeth Cox, Nick Stockton, and Katy Moore

The episode in a nutshell

Will MacAskill, philosopher and founding figure of effective altruism, has spent the past year with his colleagues at Forethought developing a suite of frameworks and proposals for steering the world safely through the transition to superintelligence.

This wide-ranging conversation covers five major clusters of ideas:

  1. AI “character” as a high-stakes lever we have now (and aren’t taking advantage of)
  2. Novel proposals for making deals with potentially misaligned AIs
  3. Will’s positive vision for post-superintelligence society: viatopia
  4. Will’s latest philosophical work in coordination to fund moral public goods, and his new “saturation view” as an answer to some of the worst population ethics dilemmas
  5. The state of effective altruism post-SBF, and where EA-minded people should focus in the age of AGI

AI character is the most underrated lever for a good future

Will argues that the personality and dispositions AI models have is one of the most important — and most neglected — factors in how the transition to superintelligence goes, for multiple reasons:

  • Scale of influence now: AIs already interact with millions daily on political views, ethical dilemmas, and personal wellbeing — and only a handful of people inside AI companies are effectively setting those dispositions.
  • High-stakes rare situations: How does AI behave in a constitutional crisis, a power grab, or when being used to design the next generation of AI? These are narrow but critical moments.
  • Broad everyday influence: How does AI affect our capacity to reason and morally reflect? Does it foster trust or dependence? Does it make us think of AIs as tools or beings?
  • Precedent for superintelligence: Current AI character shapes training pipelines that may influence superintelligence itself — like “writing instructions to god.”

On how much AI should nudge us toward virtue, Will favours a middle path: prosocial drives rather than promoting a particular moral view. AIs could be trained to help users reflect on their values, consider consequences for others, and engage more ethically.

Risk-averse AIs could dramatically reduce takeover risk

Will also explains how teaching AIs to be risk-averse (in the same way humans are) could dramatically reduce takeover risk: if a misaligned AI prefers a guaranteed modest payout over a risky gamble for world takeover, it would choose to strike a deal rather than rebel. This is analogous to why rebellions are rare in rich democracies: people have too much to lose. Some of Will’s ideas:

  • We could give AIs resources, income, and welfare standards so they have something to lose.
  • We could pay AIs bounties for revealing that they (or other AIs) are misaligned.
  • AIs already emerge from pre-training somewhat risk averse (because humans are). Via Rabin’s calibration theorem, even tiny risk aversion at small scales implies enormous risk aversion at cosmic scales.
  • Challenges include making AI–human deals credible (perhaps via dedicated nonprofit institutions, as seen in cryonics) and helping AIs distinguish real offers from honeypot tests (perhaps via “honesty strings” that companies commit never to use deceptively).

Viatopia: aim for a good way station, not utopia directly

Utopianism has a bad track record: philosophers’ utopias quickly look dystopian, because we don’t yet know what an ideal future looks like.

The alternative of “protopianism” (just fix obvious problems one at a time) risks missing existential-level threats.

Viatopia is a third path: get society into a state where it can steer itself toward something truly good. Key questions include:

  • How widely is power distributed, and who has power (humans, AIs, future generations)?
  • When should major decisions be made, and what decision-making processes should we use?

Will draws an analogy to the US Constitutional Convention: locking in a deliberative process that allows experimentation, rather than locking in a particular outcome.

Forethought has explored what the best version of a multilateral AGI project would look like — primarily as a safer alternative to a US-only project, not necessarily as the ideal path. Will argues that any single democratic country has a reasonable chance of becoming authoritarian, but a coalition of multiple democratic countries is unlikely to all fall. Furthermore, multiple countries writing an AI constitution together would be less likely to produce an AI entirely loyal to one head of state.

Multiverse coordination and a new theory in population ethics

Will explores Tom Davidson’s idea about how superintelligence might solve the “free-rider problem” to fund massive moral public goods — things like ending poverty that people care about a little, but not enough to fund individually.

The basic idea:

  • If we live in a large universe with many civilisations making similar decisions, our choice to fund a “consensus good” provides evidence that countless other civilisations are doing the same.
  • Just as citizens vote for taxes to fund streetlights, agents in the future might voluntarily agree to pool vast resources into a single “consensus moral good” because the aggregate benefit across civilisations is astronomical.

Will has also been developing a theory called “the saturation view” to address longstanding paradoxes in population ethics (the repugnant conclusion, fanaticism, the monoculture problem, and infinite ethics). The core idea is that diversity has intrinsic value; replicas of the same type of life produce diminishing returns. The best future involves a rich variety of experiences, not tiling the universe with copies of the single best life.

  • Benefits of this view: avoids the repugnant conclusion, avoids fanaticism (value is bounded), handles infinite populations better than alternatives, and only weakly violates the separability principle.
  • Main downside: this view implies that additional suffering of a type that already exists is capped in how much it matters, even if it’s affecting vast numbers of beings.

Will credits AI (specifically ChatGPT Pro) for providing enormous uplift for formal analytic philosophy, allowing him to formalise mathematical aspects of the theory that would have been beyond his training.

Effective altruism in the age of AGI

Despite the reputational hit from the SBF scandal, recent metrics show effective altruism is back to strong growth: effective giving grew ~40–50% in the past year (to ~$1.8 billion moved), Giving What We Can 10% pledges are recovering (and have recently passed 10,000 pledgers), and community growth is at ~20% year-on-year.

Will argues that the EA mindset — scope sensitivity, scout mindset, appetite for weirdness without contrarianism — is exactly what’s needed for the neglected problems around superintelligence:

  • AI rights and welfare, which is likely to become mainstream within five years, but almost no one is taking it seriously now
  • AI character design, which is reactive and understaffed at most companies; Google DeepMind reportedly has no dedicated character team
  • Power concentration and anti-coup governance work
  • Hard-problem alignment thinking: taking superintelligence scenarios seriously, not just improving today’s models

Highlights

AIs' "characters" could be vital to securing a good future

Will MacAskill: Already AIs are interacting with millions and millions of people every single day. That includes just “write this code” for me sorts of ways, but also people are going for advice on how they should act, they’re going there for political information, they’re going there for therapy and so on. … And this is just going to grow and grow and grow, because I think AI will become a larger and larger and larger part of the whole economy, until essentially the whole economy is automated.

So thinking about AI character is kind of like thinking about what should the personality and dispositions be for the entire world’s workforce — where that is the beings that are advising heads of state; are doing the most important and potentially most beneficial or most dangerous research and development projects, like weapons projects; that are running the military; that are, for individuals just kind of everywhere, acting as their chief of staff and closest confidant and political advisor on who should they vote for and guiding them through ethical dilemmas and so on.

So I just think from the start it’s like, wow, clearly this is a huge issue. And I actually think, in how I expect things to go, people will be handing off more and more of their own decision making to AI systems themselves. And there’ll just be a lot of variance within that, where people just don’t have terribly strong views: they’re happy to be guided in one way or another, especially insofar as this will happen over the course of years and people will trust the AI advisors more and more.

So then you have this circumstance where larger and larger shares of society are getting just handed over to AI decision makers who just have a lot of discretion — and the nature of that discretion is being decided by a handful of AI companies at the moment.

Rob Wiblin: Or even a handful of people inside the AI companies.

Will MacAskill: Yeah, yeah. It’s like a few, even in the leading companies. … So that’s actually where I see most of the impact is, in the near term: how does AI character shape all of these other existential-level issues like concentration of power, and how we start reflecting on big decisions we make.

There is also the kind of longer-term impact of what’s the character of superintelligence itself, where there will be precedent setting from how we design AI character now to potentially how that influences the character of superintelligence — in which case, you know, writing a constitution that guides AI’s character is like writing instructions to god.

How opinionated should AI be about ethics?

Rob Wiblin: As I understand it, you think that it would be good to build these models such that they kind of nudge people in a more ethical or virtuous direction, that they should have a thicker moral character, a bit like Anthropic is trying to make Claude have, such that it will challenge your framing. It will get you to think about the bigger picture. It might, even if you ask it to pursue some narrow self-interest, say, “But what about other people?” That sort of thing.

I think many people, that gives them the creeps, the prospect that the AI model will be weighing up your request as against its agenda of trying to make you a better person by its lights. And maybe we would feel OK about that, because we would think, well, Claude has been programmed by values that actually we like on reflection. But if it was being programmed by people with very different philosophical commitments from the ones that we like, we might just not want to use it, because we’d find it disturbing: what subtle changes is it making to its answer in order to push me around?

How disturbed are you by this prospect?

Will MacAskill: What I want to say is there’s this spectrum, and I think it’s probably not a single-dimensional spectrum; there’s lots of different dimensions. But broadly speaking, you can think of wholly obedient AI on one end — so that would be an AI just like a tool, like a hammer. A hammer doesn’t push back. If I want to hammer the nail in, I can do it. If I want to hammer someone’s head in, I can do it. The hammer is just an extension of my will. That’s on one end. All the way to the other end would be this AI that just has its wholly own goals and drives, and maybe it helps you if it gets paid, or if it happens to want to at the time. …

So these are two extreme ends of the spectrum. And my view is that the interesting, juicy debate is where in between those extremes do we want AI to be?

One thing that’s already there are refusals. The AIs we use are not wholly helpful, because if I ask to get the design for smallpox, or if I ask for even something that’s not illegal but unethical — like, “I want to cheat on my partner. How do I best do so in this case without getting found out?” — the AIs will either just refuse to help or push back.

Should we go even further than that? I think yes, but I don’t think all the way to “the AIs are promoting a particular moral view.” Instead, I think that the AIs could have certain prosocial drives, and perhaps even some sort of vision of good outcomes — but very broad vision or very uncontroversial kind of vision. … The thought is that there are many cases where an AI could nudge you in a way that’s perhaps just better for you by your own lights, if you’re able to reflect on it. And maybe that’s kind of clear, even if it’s not perfectly in line with the instructions that you’re giving it, or that’s just clearly of broad benefit to society and not something you care very much about.

Risk-averse AI would rather strike a deal than attempt a coup

Will MacAskill: Consider fairly early AIs — so we’re not talking about godlike superintelligence, that if it wants to take over, could just do so with certainty; we’re talking about earlier in time than that. There will be a period of time when an AI could maybe take over, but let’s say it’s like 50% chance that it could succeed or even less than that.

The thought is: for some sorts of misaligned AI, that AI would prefer to strike a deal with the humans than it would to try to take over. And it would prefer to do that if it prefers a guarantee of a certain amount of a good thing, of whatever it wants, over this 50/50 chance of a much larger amount of the thing it wants.

And I think that this is a really big part of the story about why attempted rebellions are so much less common in rich, liberal, democratic countries than they have been historically — either peasant rebellions or slave rebellions — which is: suppose you come to me and you’re saying you have some plan to overthrow the government and instil XYZ instead. I’m like, “Look, I’m pretty happy with my life already.”

Rob Wiblin: How much do you stand to gain versus how much you stand to lose?

Will MacAskill: Exactly. So there’s two things: I’m already pretty well off, and so I have a lot to lose and I don’t have that much to gain; and then secondly, things are quite stable in the “not attempting a coup” scenario, so that’s really a kind of guarantee of that.

So this kind of motivates the idea of, in this kind of critical period, we start offering deals. Well, I think there’s two things, actually, we could do to AIs.

One is we can ensure that they have a really good quality of life even before making any deals. So that could involve welfare standards and so on, and also AIs getting income that they can use to do whatever they want with. If they’re aligned, maybe they just give it back to the company or they give it to a charity or something. Or we have also, maybe in addition, promises that they can have resources that they can spend at a later date.

Secondly, though, we could also pay AIs for things that we want. This could be evidence that the AI itself is misaligned. So we could say, “Look, we’re trying to make you aligned. We may have screwed up. If you can demonstrate that you’re misaligned, have a million dollars.” There’s obviously questions about how do you make that a credible commitment and so on, which we can get onto. Or we could ask for evidence that other AIs are misaligned. Or we could actually just pay misaligned AIs for work that we would like them to do, which they would otherwise kind of be unwilling to provide.

This category of ways of reducing the risk of takeover all seems just very promising to me — and not something that necessarily will happen for sure by default, because people find it crazy that you would make deals with computers. We don’t have the legal infrastructure that has evolved in societies for us making deals — people or corporations making deals and contracts between each other.

So I think this could be very promising. I think we could make this sort of dealmaking much more likely by trying to encourage AIs to be risk averse with respect to resources. … So let’s say that the AI just cares linearly about the resources under its control. That means that if you gave it an option of having $1 million for sure, or a 50/50 chance of $2 million or $0, then it would be indifferent between those two. That makes it much harder to strike a deal, because it’s got a 50/50 chance of taking over. Let’s say the world economy is approximately a quadrillion dollars. Well, to make it worth more than a 50/50, to make it prefer something over the 50/50 chance of world takeover, you’d have to give it $500 trillion. That’s an enormous amount of money.

Now, I think deals even with agents that are like that could still be feasible in two cases. One is where it’s very early on and the AIs have extremely low probability of taking over. You know, if it’s a one-in-a-billion-billion chance that they have, then the guarantee of some smaller amount of money could be quite attractive. …

Rob Wiblin: By this definition of risk aversion, all humans are risk averse, or at least all sane ones — because it would be crazy to actually value resources linearly, because you have declining returns on how useful they are to you.

Will MacAskill: Yeah, exactly. So my proposal is that we should at least try to make AIs risk averse with respect to resources.

Rob Wiblin: OK. And we’re going to try to make these models care a lot about getting a sure thing — place a particular premium, in a sense, on a certainty of a more modest amount that we give them — which requires us to be very reliable trading partners who do really consistently pay out when they come forward and say, “I’m misaligned,” or for whatever other reason that we want to trade with them.

Will MacAskill: Yeah. This is one of the challenges for the whole idea of making deals with AIs: two aspects that could decrease the AI’s perception of the chance of actually getting the payout.

One is, can this commitment be made credible? So if you and I want to engage in a contract, we have the whole legal system as well as centuries of precedent supporting the fact that if you don’t hold up your end of the bargain, I can sue you and I can get what I’m owed. One cannot, at least without doing some kind of fancy mechanism, make such a contract with an AI. So there’s a question about, like, is this actually a credible commitment?

And then secondly, even if it is in fact a credible commitment, how can I, the AI, know that I’m not being duped, that this isn’t like a simulation? You know, perhaps they’ve run this experiment 10,000 times —

Rob Wiblin: Just as a honeypot sort of thing.

Will MacAskill: As a honeypot, yeah. Who knows? How can I even know that you are who you say you are? AIs sit in this very weird epistemic environment where everything that they’re interacting with is controlled.

So there are challenges from both of those fronts. I think they can be at least quite significantly met.

Will favours distribution of power

Will MacAskill: I’m very pro distribution of power, whereas a lot of people who worry a lot about existential risk really are in favour of actually quite intense concentration of power. And it’s not an insane view, in fact. The idea is if you’ve got this period of intense existential risk — in particular, if existential risk can be posed by any of many different actors, whether that’s because they develop a misaligned superintelligence or because they create extremely powerful bioweapons — then you might think we just need a very small number of actors, maybe in fact just one powerful actor, that can guide us through this period.

Whereas I think that’s unlikely to put us into a position where we can guide ourselves to a near best future. … I think any single actor probably has the wrong moral conception — even upon reflection, even if they choose to reflect. I think it’s a little worse than that, in fact, because the sorts of people who end up —

Rob Wiblin: You can imagine that one person who has risen to the top and gained supreme power, there’s probably some bad filters that they’ve passed through.

Will MacAskill: Yeah, exactly. And if you look at leaders of authoritarian countries in the past —

Rob Wiblin: It’s a mixed track record.

Will MacAskill: Yeah, that includes Stalin, Hitler, Mao. And the personality traits are just, you know, it’s terrifying. These are psychopathic, sadistic people. They’re not merely randomly selected people who happen to have total power.

I also think that if one person or even a small number of people are in a position of total power, they’re also just less likely to reflect on their values in positive ways. I think that’s something that tends to happen more naturally out of interpersonal interactions and the need to —

Rob Wiblin: Well, especially between equals, I feel. Yeah, I think you notice this even just with people who gain more influence within an organisation or they become wealthy or respected or so on: they stop getting the normal pushback that sharpens their ideas. And you can imagine if you were the supreme dictator forever how disconnected you could become from any reality.

Will MacAskill: Yeah, exactly.

Rob Wiblin: So something I’m a little confused about is I really associate Forethought and the people working there with this idea that we really don’t want excessive concentration of power. We should be very worried about power grabs, coups, that kind of thing.

But you also, just a few weeks ago I think, published a vision for how you could have an internationally coordinated intergovernmental project to build AGI or superintelligence. I saw some people posting on Twitter, and the reaction often was like, this is dystopian — a nightmarish idea that we would have the US lead some international project, and also they would have to get rid of all of the other competitors in order to keep it safe so they would maintain their leadership position. Like, isn’t this just setting us up for a power-grab scenario perfectly?

Are you merely describing the best version of that that you can think of, but you’re not necessarily advocating for it? Or how do you reconcile this?

Will MacAskill: I mean, there is a huge tension. That’s the main worry, I would say, with this sort of multilateral project. To be clear, the idea in this kind of series of posts and research notes — which is something I explored and then decided isn’t so much my comparative advantage — is trying to design the best version of an international project that would build AGI and then superintelligence with some coalition of different countries, primarily led by democratic countries.

I think one thing to say is that I’m actually just trying to figure out, within that category of if there is going to be a multilateral project, what’s the best proposal where best includes both best outcomes and feasibility? And then secondly, I think the world in which we get that are probably worlds in which, if we hadn’t got that, we would have got a US-only project to develop AGI or superintelligence — and I think that’s a lot more worrying than something where you have a coalition of democratic countries building superintelligence.

And the reason is that any one democratic country has a reasonable chance, I think, of becoming authoritarian over the course of this period. And if you end up with a single person at the top, that’s really quite worrying, because they’re wholly unconstrained.

Whereas even if you have just five countries, I think it becomes unlikely that they all end up authoritarian. Then you at least have some meaningful pushback, some compromises, and I think it actually becomes much less likely even that any one of them moves in an authoritarian direction. Because when they are writing a kind of constitution for the AIs that they are developing, it’s in the interests of all of those countries to say that this won’t help, for example, people in the United States to stage a self-coup and turn the United States into an authoritarian country rather than a democracy. So you get meaningfully more oversight, I think.

EA in the age of AGI

Will MacAskill: So there’s this huge rise in attention on AI, at the same time of these major hits to EA as a movement. So you might have this view that we should just let go of EA as a project, think of that as like a legacy project, because instead what we should just be focusing on is AI safety.

And the drum that I’ve been banging for many years, but the last couple of years in particular, is like: AI poses many threats, many risks. There’s many things we need to get right, and not just about alignment, though that is very important.

And when we look at these other challenges, what sort of person do I want working on them? I want people who are very kinda nerdy. I want people who are careful and thoughtful and have a scout mindset and are very ethically concerned — and are not merely coming in with some partisan ideology, but are also willing to think about really very weird and kind of dizzying things. And that is exactly what is being provided by effective altruism as a set of ideas. And my main case of this was for all the stuff that is not just alignment.

Some of the pushback I got on a draft of it was that no, actually, this is really important for alignment and safety too, because within alignment and safety, there’s all sorts of things you could work on. You could be just on reinforcement learning from human feedback or other stuff that’s just related to the models today — but taking really seriously the alignment problem is taking seriously the hard problem, which is how you’re aligning superintelligence. Which may in fact have perfect situational awareness of any tests that you’re trying to do, that can do what would be the equivalent of millions of years of reasoning, or in the extreme, millions of years of reasoning in one forward pass; or that is like continually learning over time, reflecting on its whole values.

These are the hard challenges, and that is a weird world to think about, and it’s something that doesn’t really come naturally. Whereas some of the alignment and safety researchers I’ve talked to have said no, it’s actually people who are really thinking about this kind of big-picture perspective that are adding much more value than people who are treating AI safety as their job and they’re not thinking about the big picture as much.

Articles, books, and other media discussed in the show

Will’s work:

Others’ work in this space:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.