After dropping out of his ML PhD at Stanford, Daniel Ziegler needed to decide what to do next. He’d always enjoyed building stuff and wanted to help shape the development of AI, so he thought a research engineering position at an org dedicated to aligning AI with human interests could be his best option.

He decided to apply to OpenAI, spent 6 weeks preparing for the interview, and actually landed the job. His PhD, by contrast, might have taken 6 years. Daniel thinks this highly accelerated career path may be possible for many others.

On today’s episode Daniel is joined by Catherine Olsson, who has also worked at OpenAI, and left her computational neuroscience PhD to become a research engineer at Google Brain. They share this piece of advice for those interested in this career path: just dive in. If you’re trying to get good at something, just start doing that thing, and figure out that way what’s necessary to be able to do it well.

To go with this episode, Catherine has even written a simple step-by-step guide to help others copy her and Daniel’s success.

Daniel thinks the key for him was nailing the job interview.

OpenAI needed him to be able to demonstrate the ability to do the kind of stuff he’d be working on day-to-day. So his approach was to take a list of 50 key deep reinforcement learning papers, read one or two a day, and pick a handful to actually reproduce. He spent a bunch of time coding in Python and TensorFlow, sometimes 12 hours a day, trying to debug and tune things until they were actually working.

Daniel emphasizes that the most important thing was to practice exactly those things that he knew he needed to be able to do. He also received an offer from the Machine Intelligence Research Institute, and so he had the opportunity to decide between two organisations focused on the global problem that most concerns him.

Daniel’s path might seem unusual, but both he and Catherine expect it can be replicated by others. If they’re right, it could greatly increase our ability to quickly get new people into ML roles in which they can make a difference.

Catherine says that her move from OpenAI to an ML research team at Google now allows her to bring a different set of skills to the table. Technical AI safety is a multifaceted area of research, and the many sub-questions in areas such as reward learning, robustness, and interpretability all need to be answered to maximize the probability that AI development goes well for humanity.

Today’s episode combines the expertise of two pioneers and is a key resource for anyone wanting to follow in their footsteps. We cover:

  • What is the field of AI safety? How could your projects contribute?
  • What are OpenAI and Google Brain doing?
  • Why would one decide to work on AI?
  • The pros and cons of ML PhDs
  • Do you learn more on the job, or while doing a PhD?
  • Why did Daniel think OpenAI had the best approach? What did that mean?
  • Controversial issues within ML
  • What are some of the problems that are ready for software engineers?
  • What’s required to be a good ML engineer? Is replicating papers a good way of determining suitability?
  • What fraction of software developers could make similar transitions?
  • How in-demand are research engineers?
  • The development of Dota 2 bots
  • What’s the organisational structure of ML groups? Are there similarities to an academic lab?
  • The fluidity of roles in ML
  • Do research scientists have more influence on the vision of an org?
  • What’s the value of working in orgs not specifically focused on safety?
  • Has learning more made you more or less worried about the future?
  • The value of AI policy work
  • Advice for people considering 23andMe

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

The 80,000 Hours Podcast is produced by Keiran Harris.


Catherine Olsson: “AI safety” is not one thing. It’s definitely not one field. If anything, it’s a community of people who like to use the phrase “AI safety” to describe what they’re interested in. But if you look at what different groups or different people are working on, they’re very very different fields of endeavor. So you have groups that are trying to take deep reinforcement learning and introduce a human feedback element so that you can learn human preferences in a deep RL system. That’s one research agenda.

… MIRI has folks working on decision theory. If we understood decision theory better, than we would know better what a good system should be like. Okay, decision theory theorem proving is just categorically a completely different type of work from deep reinforcement learning. You’ve got groups that are working on… so like my group, for example, working on robustness in machine learning systems. How do we know that they’ve learned the thing that we wanted them to learn? Also a completely different field of endeavor.

And it’s very important to keep in mind if you’re looking for a, quote, career in AI safety, is what exactly is it that you think is gonna be important for what trajectory that you think the world is gonna be on? And then what are the particular sub skills that it’s gonna take? Because it’s not a monolith at all. There’s many many different groups taking many many different approaches, and the skills you need are gonna be extraordinarily different depending on the path.

Daniel Ziegler: Normally in the reinforcement learning paradigm, you have some agent acting in some environment, so it might be playing a video game or it might be controlling a robot, it may be a real robot, maybe a simulated robot, and it’s trying to achieve some sort of well-defined goal that’s assumed to be specified as part of the environment. So in a video game, that might be the score. In robotics tasks, it might be something like run as far as you can in 10 seconds, and something that’s a hard-coded function that’s easily specified as part of the environment.

For a lot of more interesting, real-world applications, that’s not really going to work. It’s too difficult to just write down the reward function that tells you exactly how well you’re doing, because there’s just too many things to take into account. The Safety Team said, okay, let’s relax this assumption and instead of assuming that the reward function is built into the environment, we’ll actually try to learn the reward function based on human feedback.

One of the environments that was like a little simulated robotics task where you have this little hopping agent, just like a big leg basically, we gave humans these examples of what the leg was currently doing. We gave them two examples, one on the left and one on the right, and then the human had to decide which of those was doing a better job, according to whatever the human thought the job should be. So one thing we got the little hopper to do is to do a backflip. It turns out, it’s actually pretty tricky to write down a hard-coded reward function for how to do a backflip, but if you just a few hundred times show a human, is this a better backflip or is this a better back flip, and then have the system learn from that what the human is trying to aim for, that actually works a lot better.

So the idea is, now instead of having to write down a hard-coded reward function, we can just learn that from human oversight. So now what we’re trying to do is take that idea and take some other kinds of bigger mechanisms for learning from human feedback and apply real, natural language to that. So we’re building agents which can speak in natural language themselves and maybe take natural language feedback, and trying to scale those up and move in the direction of solving more real tasks.

Catherine Olsson: I think the best way to figure out what’s going on is just to dive in. In fact, I’m directly referencing a post by Nate Soares, called Dive In, which I love and recommend, that if you have an extremely concrete plan of how you’re going to contribute that has actionable and trackable steps, you’re going to start getting data from the world about your plan a lot sooner than if you have some unreachable or nebulous plan. I would encourage anyone who’s interested in this sort of thing to look for the smallest step that you can take that brings you just a little closer. If you’re currently a software engineer and you can take a statistics class and maybe do some data science in your current role, by all means do that. Take just one step closer to something in the space of machine learning.

If you can just do software engineering at an organization that does ML, if you take that role, you’ve just got your face in the data in a much more concrete and tangible way. I think, particularly folks who are coming at this topic from an EA angle, maybe you’ve read Superintelligence, whatever your first intro was, those abstractions or motivating examples are quite far removed from the actual work that’s being done and the types of systems that are being deployed today. I think starting to bridge that conceptual gap is one of the best things that you can do for yourself if you’re interested in starting to contribute.

Daniel Ziegler: Yes, and I would say, try just diving in all the way if you can. Like I said, when I was preparing for the OpenAI interviews, I went straight to just implementing a bunch of deep reinforcement learning algorithms as very nearly my first serious project in machine learning, and obviously there were things along the way where I had to shore up on some of the machine learning basics and some probability and statistics and linear algebra and so forth, but by doing it in sort of a depth-first manner, like where I just went right for it and then saw as I went what I needed to do, I was able to be a lot more efficient about it and also just actually practice the thing that I wanted to be doing.

Articles, books, and other media discussed in the show

Spinning Up in Deep RL:

  • A short introduction to RL terminology, kinds of algorithms, and basic theory.
  • An essay about how to grow into an RL research role.
  • A well-documented code repo of short, standalone implementations of: Vanilla Policy Gradient (VPG), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).
  • And a few exercises to serve as warm-ups.

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.