Anonymous advice: If you want to reduce AI risk, should you take roles that advance AI capabilities?

By Benjamin Hilton · Published October 2022 ·

Image generated by DALL-E 2.

Table of Contents

1 Advice from 11 anonymous experts
2 Learn more

We’ve argued that preventing an AI-related catastrophe may be the world’s most pressing problem, and that while progress in AI over the next few decades could have enormous benefits, it could also pose severe, possibly existential risks. As a result, we think that working on some technical AI research — research related to AI safety — may be a particularly high-impact career path.

But there are many ways of approaching this path that involve researching or otherwise advancing AI capabilities — meaning making AI systems better at some specific skills — rather than only doing things that are purely in the domain of safety. In short, this is because:

Capabilities work and some forms of safety work are intertwined.
Many available ways of learning enough about AI to contribute to safety are via capabilities-enhancing roles.

So if you want to help prevent an AI-related catastrophe, should you be open to roles that also advance AI capabilities, or steer clear of them?

We think this is a hard question! Capabilities-enhancing roles could be beneficial or harmful. For any role, there are a range of considerations — and reasonable people disagree on whether, and in what cases, the risks outweigh the benefits.

So we asked the 22 people we thought would be most informed about this issue — and who we knew had a range of views — to write a summary of their takes on the question. We received 11 really interesting responses, and think that these are likely a reasonable representation of the range of views held by the broader set of people.

We hope that these responses will help inform people making difficult decisions about working in roles that might advance AI capabilities. We also used these responses to help write our review of working at leading AI labs.

If you can’t follow some of the below, don’t worry! Check out our problem profile on preventing an AI-related catastrophe for an introduction to the terms, concepts, and arguments referenced here.

The following are written by people whose work we respect and who would like to remain anonymous. These quotes don’t represent the views of 80,000 Hours, and in some cases, individual pieces of advice may explicitly contradict our own. Nonetheless, we think it’s valuable to showcase the range of views on difficult topics where reasonable people might disagree.

We’ve included the responses from these 11 experts in full. We’ve only made minor edits for clarity and ease of reading.

Advice from 11 anonymous experts

Expert 1: Right now I wish AI labs would slow down on the margin, but…

Right now I wish AI labs would slow down on the margin, but I don’t think it’s obvious that capabilities work is net negative and reasonable people can disagree on this point. What I say below would probably change if I were highly confident it’s extremely bad to advance capabilities (especially if I believed that advancing capabilities by a month is much more bad than advancing alignment research by a month is good). With that said:
AI labs are a great place to pick up good skills, especially for people doing technical roles (ML engineer, ML researcher) at those labs. If you’re early career and can get into one and think it’d be more interesting and a better personal fit for you than other roles you’re considering, you should probably go for it — on a pretty wide distribution of views (including those that think capabilities enhancement is pretty net negative), the investment in your human capital probably creates more good than your contribution to capabilities (at a junior level) creates harm.
It’s worth specifically asking about working on the safety teams of capabilities organizations and not assuming you have to choose between a pure capabilities role and no role; this won’t always work but I’d expect a reasonable fraction of the time you can tilt your role toward more safety projects (and this will probably — though not always — make your on-the-job learning a little more relevant/useful to future safety projects you might do).
There are some roles (usually more senior ones) that seem super leveraged for increasing overall capabilities and don’t really teach you skills that are highly transferable to safety-only projects — for example, fundraising for an AI lab or doing comms for an AI lab that involves creating hype around their capabilities results. These seem likely to be bad unless you really back the lab and feel aligned with it on views about safety and values.
You should always assume that you are psychologically affected by the environment you work in. In my experience, people who work at capabilities labs tend to systematically have or develop views that AI alignment is going to be fairly easy and I’d guess that this is in significant part due to motivated reasoning and social conformity effects. I think I’d be most excited by an AI alignment researcher who spends some time at AI labs and some time outside those environments (either at a safety-only company like ARC or Redwood, a safety-focused academic group, or doing independent research). It seems like you get important perspective from both environments, and it’s worth fighting the inertia of staying in a capabilities role that’s comfortable, continually getting promoted, and never quite getting around to working on the most core safety stuff.

Expert 2: There are some portions of AGI safety that are closely tied to capabilities…

There are some portions of AGI safety that are closely tied to capabilities, in particular amplification and variants, and I think these are worth pursuing on net (though without high confidence). That is, a common message expressed these days is that EA folk should only work on purer areas of safety, but I think that’s suboptimal given that a complete solution involves the capability-mixed areas as well.

Expert 3: If humanity gets artificial general intelligence well before it knows how to aim it…

If humanity gets artificial general intelligence well before it knows how to aim it, then humanity is likely to kill itself with it, because humanity can’t coordinate well enough to prevent the most-erroneously-optimistic people from unleashing a non-friendly superintelligence before anyone knows how to build a friendly one. As such, in the current environment — where capabilities are outpacing alignment — the first-order effect of working on AI capabilities is to hasten the destruction of everything (or, well, of the light-cone originating at Earth in the near future). This first-order effect seems to me to completely dominate various positive second-order effects (such as having more insight into the current state of capabilities research, and being able to culturally influence AI capabilities researchers). (There are also negative second-order effects to working on capabilities in spite of the negative first-order effects, like how it undermines the positive cultural effect that would come from all the conscientious people refusing to work on capabilities in lieu of a better alignment story.)
That said, the case isn’t quite so cut-and-dry. Sometimes, research in pursuit of alignment naturally advances capabilities. And many people can’t be persuaded off of capabilities research regardless of the state of alignment research. As such, I’ll add that prosocial capabilities research is possible, so long as it is done strictly in private, among a team of researchers that understands what sort of dangers they’re toying with, and which is well capable of refraining from deploying unsafe systems. (Note that if there are multiple teams that believe they have this property, the forward light-cone gets destroyed by the most erroneously optimistic among them. The team needs to not only be trying to align their AIs, but capable of noticing when the deployed system would not be friendly; the latter is a much more difficult task.) By default, the teams that claim they will be private when it matters will happily publish a trail of breadcrumbs that lead anyone who’s paying attention straight to their capabilities insights (justified, perhaps, by arguments like “if we don’t publish then we won’t be able to hire the best people”), and will change over to privacy only when it’s already too late. But if you find a team that’s heavily focused on alignment, and that’s already refusing to publish, that’s somewhat better, in my estimation.
My own guess is still that the people saying “we can’t do real alignment work until we have more capabilities” are, while not entirely wrong, burning through a scarce and necessary resource. Namely: yes, there is alignment work that will become much easier once we have real AGIs on our hands. But there are also predictable hurdles that will remain even then, that require serial time to solve. If humanity can last 5 years after inventing AGI before someone destroys the universe, and there’s a problem that takes 20 years to solve without an AGI in front of you and 10 years to solve with an AGI in front of you, then we’d better get started on that now, and speeding up capabilities isn’t helping. So even if the team is very private, my guess is that capabilities advancements are burning time that we need. The only place where it’s clearly good, in my book, to advance capabilities, is when those capabilities advances follow necessarily from advancing our knowledge of AI alignment in the most serially-bottlenecked ways, and where those advancements are kept private. But few can tell where the serial bottlenecks are, and so I think a good rule of thumb is: don’t advance capabilities; and if you have to, make sure it’s done in private.

Expert 4: There are lots of considerations in both directions of comparable magnitudes…

There are lots of considerations in both directions of comparable magnitudes (according to me), so I think you shouldn’t be confident in any particular answer. To name a few: (1) more aligned people gaining relevant skills who may later work directly on reducing x-risk increases the expected quality of x-risk reduction work (good) (2) shorter timelines mean less time to work on alignment and governance (bad), (3) shorter timelines mean fewer actors building AGI during crunch time (good), (4) more aligned people in relevant organizations can help build the political will in those organizations to address safety issues (good), (5) shorter timelines mean less time for aligned people to “climb the ladder” to positions of influence in relevant organizations (bad), (6) shorter timelines mean less time for geopolitics to change (sign unclear).
My main piece of advice is not to underestimate the possibility of value drift. I would not be surprised to hear a story of someone who went to OpenAI or DeepMind to skill up in ML capabilities, built up a friend group of AI researchers, developed an appreciation of the AI systems we can build, and ultimately ended up professing agreement with some or the other reason to think AI risk is overblown, without ever encountering an argument for that conclusion that they would endorse from their starting point. If you are going to work in a capabilities-enhancing role, I want you to ensure that your social life continues to have “AI x-risk worriers” in it, and to continue reading ongoing work on AI x-risk.
If I had to recommend a decision without knowing anything else about you, I’d guess that I’d be (a) in favor of a capabilities-enhancing role for skilling up if you took the precautions above (and against if you don’t), (b) in favor of a capabilities-enhancing role where you will lobby for work on AI x-risk if it is a very senior role (and against if it is junior).

Expert 5: There isn’t at present any plan for not having AGI destroy the world…

There isn’t at present any plan for not having AGI destroy the world. It’s been justly and validly compared to “Don’t Look Up,” but there’s companies pulling in the asteroid in the hope they can turn a profit, except they don’t even really have a plan for not everyone dying. Under those circumstances, I don’t think it’s acceptable — consequentialistically, deontologically, or as a human being — to “burn the capabilities commons” by publishing capabilities advances, opening models, opening source code, calling attention to techniques that you used to make closed capabilities advances, showing off really exciting capabilities that get other people excited and entering the field, or visibly making it look like AI companies are going to be super profitable and fundable and everyone else should start one too.
There’s job openings in AI all over, unfortunately. Take a job with a company that isn’t going to push the edge of capabilities, isn’t going to open-source anything, isn’t going to excite more entrants to the ML field; ideally, get a clear statement from them that their research will always be closed, and make sure you won’t be working with other researchers that are going to be sad about not being able to publish exciting papers for great prestige.
Closed is cooperating. Open is defecting. Make sure your work isn’t contributing to humanity’s destruction of humanity, or don’t work.
Try not to fall for paper-thin excuses about far-flung dreams of alignment relevance either.

Expert 6: I think that the simple take of “capabilities work brings AGI closer which is bad because of AI x-risk” is probably…

I think that the simple take of “capabilities work brings AGI closer which is bad because of AI x-risk” is probably directionally correct on average, but such a vast oversimplification that it’s barely useful as heuristic.
There are many different ways in which capabilities work can have both positive and negative effects, and these can vary a lot depending both on what the work is and how it is used and disclosed. Here are some questions I would want to consider when trying to judge the net effect of capabilities work:
What is the direct effect on AGI timelines? Capabilities work that directly chips away at plausible bottlenecks for AGI (I’ll call such work “AGI-bottleneck” work) is likely to make AGI arrive sooner. The biggest category I see here is work that improves the efficiency of training large models that have world understanding, whether via architectural improvements, optimizer improvements, improved reduced-precision training, improved hardware, etc. Some work of this kind may have less of a counterfactual impact: for example, the work may be hard to build upon because it is idiosyncratic to today’s hardware, software or models, or it may be very similar to work being done by others anyway.
What is the effect on acceleration? Capabilities work can have an indirect effect on AGI timelines by encouraging others to either (a) invest more in AGI-bottleneck capabilities work, or (b) spend more on training large models, leading to an accelerated spending timeline that eventually results in AGI. At the same time, some capabilities work might encourage others to work on alignment, perhaps depending on how it is presented.
What is the effect on takeoff speeds? Spending more on training large models now could lead to a slower rate of growth of spending around the time of AGI, by reducing the “spending overhang”. This could improve outcomes by giving the world longer with near-AGI models, having which would increase the attention on AI alignment, make it more empirically tractable, and make it easier for institutions to adapt. Of course, spending more on training large models likely involves some AGI-bottleneck capabilities work, and the benefits are limited by the fact that not all alignment research requires the most capable models and that the AI alignment community is growing at least somewhat independently of capability advancements.
What is the effect on misalignment risk? Some capabilities work can make models more useful without increasing misalignment risk. Indeed, aligning large language models makes them more useful (and so can be considered capabilities work), but doesn’t give the base model a non-trivially better understanding of the world, which is generally seen as a key driver of misalignment risk. This kind of work should directly reduce misalignment risk by improving our ability to achieve things (including outcompeting misaligned AI, conducting further alignment research, and implementing other mitigations) before and during the period of highest risk. It’s also worth considering the effects on other risks such as misuse risk, though they are generally considered less existentially severe.
What is the effect on alignment research? Some capabilities work would enable new alignment work to be done, including work on outer alignment schemes that involve AI-assisted evaluation such as debate, and empirical study of inner misalignment (though it’s hotly debated how far in advance of AGI we should expect the latter to be possible). Other capabilities work may enable models to assist or conduct alignment research. In fact, a lot of AGI-bottleneck work may fall into this category. Of course, a lot of current alignment research isn’t especially bottlenecked on model capabilities, including theoretical work and interpretability.
How will the work be used and disclosed? The potential downsides of capabilities work can often be mitigated, perhaps entirely, by using or disclosing the work in a certain way or not at all. However, such mitigations can be brittle, and they can also reduce the alignment upsides.
Overall, I don’t think whether a project can be labeled “capabilities” at a glance tells you much about whether it is good or bad. I do think that publicly disclosed AGI-bottleneck work is probably net harmful, but not obviously so. Since this view is so sensitive to difficult judgment calls, e.g. about the relative value of empirical versus theoretical alignment work, my overall advice would be to be somewhat cautious about such work:
Avoid AGI-bottleneck work that doesn’t have either a clear alignment upside or careful mitigation, even if there is a learning or career upside. Note that I wouldn’t consider most academic ML work AGI-bottleneck work, since it’s not focused on things like improving training efficiency of large models that have world understanding.
For AGI-related work that targets alignment but also impacts AGI bottlenecks, it’s worth discussing the project with people in advance to check that it is worthwhile overall. I’d expect the correct outcome of most such discussions to be to go ahead with the project, simply because the effect on a single project that is not optimizing for something is likely very small compared to a large number of projects that are optimizing for that thing. But the stakes are high enough that it is worth going through the object-level considerations.
Work that is only tangentially AGI-related, such as an ML theory project or applying ML to some real-world problem, deserves less scrutiny from an AGI perspective, even if it can be labeled “capabilities”. The effect of such a project is probably dominated by its impact on the real-world problem and on your learning, career, etc.
Students: don’t sweat it. The vast majority of student projects don’t end up mattering very much, so you should probably choose a project you’ll learn the most from (though of course you’re more likely to learn about alignment if the project is alignment-related).

Expert 7: Timelines are short, we are racing to the precipice…

Timelines are short, we are racing to the precipice, and some capabilities-advancing research is worth it if it brings a big payoff in some other way, but you should by default be skeptical.

Expert 8: Overall, I think there is a lot of value for people who are concerned about AI extreme/existential risks to…

Overall, I think there is a lot of value for people who are concerned about AI extreme/existential risks to work in areas that may look like they primarily advance AI capabilities, if there are other good reasons for them to go in that direction. This is because: (1) the distinction between capabilities and safety is so fuzzy as to often not be useful; (2) I anticipate safety-relevant insight to come from areas that today might be coded as capabilities, and so recommend a much more diversified portfolio for where AI risk-concerned individuals skill-up; (3) there are significant other benefits from having AI risk-aware individuals prominent throughout AI organizations and the fields of ML; (4) capabilities work is and will be highly incentivized far in excess of the marginal boost from 80k individuals, in my view.
The distinction between capabilities and safety is one that makes sense in the abstract, and is worth attending to. Labs that try to differentially work on and publish safety work, over capabilities work, should be commended. Philanthropic funders and other big actors should be thoughtful about how their investments might differentially boost safety/alignment, relative to capabilities, or not. That being said, in my view when one takes a sophisticated assessment, in practice it is very hard to clearly draw a distinction between safety and capabilities, and so often this distinction shouldn’t be used to guide action. Interpretability, robustness, alignment of near-term models, out-of-sample generalization are each important areas which plausibly advance safety as well capabilities. There are many circumstances where even a pure gain in safety can be endangering, such as if it hides evidence of later alignment risk, or incentivizes actors to deploy models that were otherwise too risky to deploy.
In my judgment, the field of AI/AGI safety has gone through a process of expansion to include ever more approaches which were previously seen as too far away from the most extreme risks. Mechanistic interpretability or aligning existing deep learning models are today regarded by many as a valuable long-term safety bet, whereas several years ago were much more on the periphery of consideration. I expect in the future we will come to believe that expertise in areas that may look today like capabilities (robustness, out of sample generalization, security, other forms of interpretability, modularity, human-AI interaction, continuous learning, intrinsic motivations) are a critical component of our AGI safety portfolio. At the least it can be useful to have more “post-doctoral” level work throughout the space of ML, to then bring insights and skills into the most valuable bets in AI safety.
Developing a career in other areas that may code as “capabilities” could lead to individuals being prominent in various fields of machine learning, and to having important and influential roles within AI organizations. I believe there is a lot of value to having the community of those concerned about AI risks to have broad understanding of the field of ML and of AI organizations, and broad ability to shape norms. In my view, much of the benefit of “AI safety researchers” does not come from the work they do, but their normative and organizational influence within ML science and the organizations where they work. I expect critical safety insights will have to be absorbed and implemented by “non-safety” fields, and so it is valuable to have safety-aware individuals in those fields. Given this view, it makes sense to diversify the specialisms which AI risk concerned individuals pursue, and to upweight those directions which are especially exciting scientifically or valuable to the AI organizations likely to build powerful AI systems.
Capabilities work is already highly incentivized to the tune of billions of dollars and will become more so in the future, so I don’t think on the margin AI risk motivated individuals working in these spaces would boost capabilities much. To try to quantify things, there were around 6,000 authors attending NeurIPS in 2021. Increasing that number by 1 represents an increase of 1/6,000. By contrast, I think the above benefits to safety of having an individual learn from other fields, potentially be a leader of a new critical area in AI safety, and otherwise be in a potentially better position to shape norms and organizational decisions, are likely to be much larger. (A relevant belief in my thinking is that I don’t believe shortening timelines today costs us that much safety, relative to getting us in a better position closer to the critical period.) Note that this last argument doesn’t apply to big actors, like significant labs or funders.

Expert 9: At the current speed of progress in AI capabilities compared to our advances on alignment…

At the current speed of progress in AI capabilities compared to our advances on alignment, it’s unlikely that alignment will be solved on time before the first AGI is deployed. If you believe that alignment is unlikely by default, this is a pretty bad state of affairs.
Given the current situation, any marginal slowdown of capabilities advancements and any marginal acceleration of work on alignment is important if we hope to solve the problem on time.
For this reason, individuals concerned about AI safety should be very careful before deciding to work on capabilities, and should strongly consider working on alignment and AI safety directly whenever possible. This is especially the case as the AI field is small and has an extreme concentration of talent: top ML researchers and engineers single handedly contribute to large amounts of total progress.
Therefore, it’s particularly important for very talented people to choose wisely on what they work: each talented individual choosing to work on AI safety over capabilities has double the amount of impact, simultaneously buying more time before AGI while also speeding up alignment work.
A crucial thing to consider is not only the organization, but the team and type of work. Some organizations that are often criticized for their work on capabilities in the community have teams that genuinely care about alignment, and work there is probably helpful. Conversely, some organizations that are very vocal about their focus on safety have large teams focusing on accelerating capabilities, privately or publicly, and work in those teams is probably harmful.
The relationship between capabilities and valuable alignment work is not binary, and much of the most promising alignment work also has capabilities implications, but the reverse is rarely true, and only accidentally so.
Some organizations and individuals, including some closely affiliated to EA, take the view that speeding up progress now is fine as the alignment problem is primarily an engineering, empirical problem, and more advanced models would allow us to do better empirical research on how to control AGI.
Another common view is that speeding up progress for “friendly” actors, such as those that claim to care more about safety, have ties to EA, and are located in non-authoritarian countries, is necessary, as we would rather have the most safety-minded actors to get to AGI first.
Those acting upon these views are being extremely irresponsible, and individuals looking to work on AI should be defiant of these arguments as an excuse to accelerate capabilities.

Expert 10: I think the EA community seems to generally overfocus on the badness of speeding capabilities…

I think the EA community seems to generally overfocus on the badness of speeding capabilities. I do think speeding capabilities is bad (all else equal), but the marginal impact of an engineer or researcher is usually small, and it doesn’t seem hard to outweigh it with benefits including empowering an organization to do better safety research, be more influential, etc.; gaining career capital of all kinds for yourself (understanding of AI, connections, accomplishments, etc.)
However, if you are in this category, I would make an extra effort to:
Be an employee who pays attention to the actions of the company you’re working for, asks that people help you understand the thinking behind them, and speaks up when you’re unhappy or uncomfortable. I think you should spend 95%+ of your work time focused on doing your job well, and criticism is far more powerful coming from a high performer (if you’re not performing well I would focus exclusively on that, and/or leave, rather than spend time/energy debating company strategy and decision making). But I think the remaining 5% can be important — employees are part of the “conscience” of an organization.
Avoid being in a financial or psychological situation where it’s overly hard for you to switch jobs into something more exclusively focused on doing good; constantly ask yourself whether you’d be able to make that switch, and whether you’re making decisions that could make it harder to do so in the future.

Expert 11: In my expectation, the primary human-affectable determinant of existential risk from AI is…

In my expectation, the primary human-affectable determinant of existential risk from AI is the degree to which the first 2-5 major AI labs to develop transformative AI will be able to interact with each other and the public in a good-faith manner, enough to agree on and enforce norms preventing the formation of many smaller AI labs that might do something foolish or rash with their tech (including foolish or rash attempts to reduce x-risk). For me, this leads to the following three points:
Working on AI capabilities for a large lab, in a way that fosters good-faith relationships with that lab and other labs and the public, is probably a net positive in my opinion. Caveat: If you happen to have a stroke of genius insight that enables the development of AGI 6 months sooner than it otherwise would have been developed, then it’s probably a net negative to reveal that insight to your employer, but also you’d have some discretion in deciding whether to reveal it, such that you having an AI-capabilities-oriented job at a major (top 5) lab is probably worth it if you’re able to contribute significantly to good-faith relations between that lab and other labs and the public. I’d count a ‘significant positive contribution’ as something like “without deception, causing Major Lab A to lower by 1% its subjective probability that Major Lab B will defect against Major Lab A if Major Lab B gets transformative AI first.” Let’s call that a “1% good-faith-contribution”. I think a 0.1% good-faith-contribution might be too small to justify working on capabilities, and a 10% good-faith-contribution is more than enough.
If you feel your ability to model social situations is not adequate to determine whether you are making a significant positive contribution to the good-faith-ness of the relationships between major (top 5) AI labs and the public, my suggestion is that you should probably just not try to work on AI capabilities research in any setting, because you will not be well positioned to judge whether the lab where you work is developing the capabilities in a way that increases good faith amongst and around AI labs.
Work on rudimentary AI capabilities for a small lab is probably fine as long as you’re not pushing forward the state of the art, and as long as you’re not participating in large defections against major labs who are trying to prevent the spread of societally harmful tech. For instance, I think you should not attempt to reproduce GPT-4 and release or deploy it in ways that circumvent all the hard work the OpenAI team will have done to ensure their version of the model is being used ethically.

Speak to our team one-on-one

If you’re considering taking a role that might advance AI capabilities or are generally thinking through this question in relation to your own career, our advising team might be able to give you personal advice. (It’s free.) We’re excited about supporting anyone who wants to make reducing existential risks posed by AI a focus of their career. Our team can help you compare your options, make connections with others working on this issue, and possibly even help you find jobs or funding opportunities.

SPEAK WITH OUR TEAM

Learn more

Our problem profile on preventing an AI-related catastrophe
Ways people trying to do good accidentally make things worse, and how to avoid them
The 80,000 Hours Podcast on Artificial Intelligence (a collection of 10 key AI episodes from our podcast)
Our career review of working at leading AI labs
Our career review of technical AI safety research
Our career review of non-technical roles in leading AI labs
Our career review of becoming an expert in AI hardware

If you want to learn much more about risks from AI, here are a few general sources (rather than specific articles) that you might want to explore:

The AI Alignment Forum, which is aimed at researchers working in technical AI safety.
AI Impacts, a project that aims to improve society’s understanding of the likely impacts of human-level artificial intelligence.
The Alignment Newsletter, a weekly publication with recent content relevant to AI alignment with thousands of subscribers.
Import AI, a weekly newsletter about artificial intelligence by Jack Clark (cofounder of Anthropic), read by more than 10,000 experts.