The deterministic twin prisoner's dilemma
Joe Carlsmith: The experiment that convinces me most is: Imagine that you are a deterministic AI system and you only care about money for yourself. So you’re selfish. There’s also a copy of you, a perfect copy, and you’ve both been separated very far away — maybe you’re on spaceships flying in opposite directions or something like that. And you’re both going to face the exact same inputs. So you’re deterministic: the only way you’re going to make a different choice is if the computers malfunction or something like that. Otherwise you’re going to see the exact same environment.
In the environment, you have the option of taking $1,000 for yourself: we’ll call that “defecting” — or giving $1 million to the other guy: we’ll call that “cooperating.” The structure is similar to a prisoner’s dilemma. You’re going to make your choice, and then later you’re going to rendezvous.
So what should you do? Well, here’s an argument that I don’t find convincing, but that I think would be the argument offered by someone who thinks you can only control what you can cause. The argument would be something like: your choice doesn’t cause that guy’s choice. He’s far away; maybe he’s lightyears away. You should treat his choice as fixed. And then whatever he chooses, you get more money if you defect. If he defects, then you’ll get nothing by cooperating and $1,000 by defecting. If he sends the money to you, then you’ll get $1.001 million by defecting and $1 million by cooperating. No matter what, it’s better to defect. So you should defect.
But I think that’s wrong. The reason I think it’s wrong is that you are going to make the same choice. You’re deterministic systems, and so whatever you do, he’s going to do it too. In fact, in this particular case — and we can talk about looser versions where the inputs aren’t exactly identical — the connection between you two is so tight that literally, if you want to write something on your whiteboard, he’s going to write that too. If you want him to write on his whiteboard, “Hello, this is a message from your copy,” or something like that, you can just write it on your own whiteboard. When you guys rendezvous, his whiteboard will say the thing that you wrote. You can sit there going, “What do I want?” You really can control what he writes. If you want to draw a particular kitten, if you want to scribble in a certain way, he’s going to do that exact same thing, even though he’s far away and you’re not in causal interaction with him.
To me, I think there’s just a weird form of control you have over what he does that we just need to recognise. So I think that’s relevant to your decision, in the sense that if you start reaching for the defect button, you should be like, “OK, what button is he reaching for right now?” As you move your arm, his arm is moving with you. And so you reach for the defect, he’s about to defect. You could basically be like, “What button do I want him to press?” and just press it yourself and he’ll press it. So to me, it feels pretty easy to press the “send myself $1 million” button.
Joe Carlsmith: The classic thought experiment that people often focus on, though I don’t think it’s the most dispositive, is this case called Newcomb’s problem, where Omega is this kind of superintelligent predictor of your actions. Omega puts you in the situation where you face two boxes: one of them is opaque, one of them is transparent. The transparent box has $1,000, the opaque box has either $1 million or nothing.
Omega puts $1 million in the box if Omega predicts that you will take only the opaque box and leave the $1,000 alone (even though you can see it right there). And Omega puts nothing in the opaque box if Omega predicts that you will take both boxes.
So the same argument arises for the causal decision theory (CDT). For CDT, the thought is: you can’t change what’s in the boxes; the boxes are already fixed. Omega already made her prediction. And no matter what, you’ll get more money if you take the $1,000. If there was some dude over there who could see the boxes, and you were like, “Hey, see what’s in the box, and what choice will give me more money?” — you don’t even need to ask, because you know it’s always just take the extra $1,000.
But I think you should one-box in this case, because I think if you one-box then it will have been the case that Omega predicted that you one-boxed, because Omega is always right about the predictions, and so there will be the million.
I think a way to pump this intuition for me that matters is imagining doing this case over and over with Monopoly money. Each time, I try taking two boxes and I notice the opaque box is empty. I take one box, opaque box is full. I do this over and over. I try doing intricate mental gymnastics. I do like a somersault, I take the boxes. I flip a coin and take the box — well, flipping a coin, Omega has to be really good, so we can talk about that.
If Omega is sufficiently good at predicting your choice, then just like every time, what you eventually will learn is that you effectively have a type of magical power. Like I can just wave my arms over the opaque box and say, “Shazam! I hereby declare that this box shall be full with $1 million. Thus, as I one-box, it is so.” Or if I can be like, “Shazam! I declare that the box shall be empty. Like thus, as I two-box, it is so.” I think eventually you just get it in your bones, such that when you finally face the real money, I guess I expect this feeling of like, “I know this one, I’ve seen this before.” I kind of know what’s going to happen at some more visceral expectation level if I one-box or two-box, and I know which one leaves me rich.
The idea of 'wisdom longtermism'
Joe Carlsmith: In the thesis, I have this distinction between what I call “welfare longtermism” and “wisdom longtermism.”
Welfare longtermism is roughly the idea that our moral focus should be on specifically the welfare of the finite number of future people who might live in our lightcone.
And wisdom longtermism is a broader idea that our moral focus should be reaching a kind of wise and empowered civilisation in general. I think of welfare longtermism as a lower bound on the stakes of the future more broadly — at the very least, the future matters at least as much as the welfare of the future people matters. But to the extent there are other issues that might be game changing or even more important, I think the future will be in a much better position to deal with those than we are, at least if we can make the right future. …
There’s a line in Nick Bostrom’s book Superintelligence about something like, if you’re digging a hole but there’s a bulldozer coming, maybe you should wonder about the value of digging a hole. I also think we’re plausibly on the cusp of pretty radical advances in humanity’s understanding of science and other things, where there might be a lot more leverage and a lot more impact from making sure that the stuff you’re doing matters specifically to how that goes, rather than to just kind of increasing our share of knowledge overall. You want to be focusing on decisions we need to make now that we would have wanted to make differently.
So it looks good to me, the focus on the long-term future. I want to be clear that I think it’s not perfectly safe. I think a thing we just generally need to give up is the hope that we will have a theory that makes sense of everything — such that we know that we’re acting in the safe way, that it’s not going to go wrong, and it’s not going to backfire. I think there can be a way that people look to philosophy as a kind of mode of Archimedean orientation towards the world — that will tell them how to live, and justify their actions, and give a kind of comfort and structure — that I think at some point we need to give up.
On the classic drowning child thought experiment
Joe Carlsmith: I think what that can do is sort of break your conception of yourself as a kind of morally sincere agent — and at a deeper level, it can break your conception of society and your peers, or society as a morally sincere endeavour, in some sense. Things can start to seem kind of sick at their core, and we’re just all looking away from the sense in which we’re horrible people, or something like that.
I actually think part of the attraction of communities like the effective altruism community, for many people, is it sort of offers a vision of a recovery of a certain moral sincerity. You find this community, and actually, these people are maybe trying — more so than you had encountered previously — to really take this stuff seriously, to act rightly by its lights. And I think that can be a powerful idea.
But there is this then this thing comes up, where it’s like, “OK, but how much is enough? Exactly how far do you go with this? What is demanded?” I think people can end up in a mode where their relationship with this is what you said: it’s about not being bad, not sucking — like you thought “maybe I sucked” and now you’re really trying not to suck — you don’t want to be kind of punished or worthy of reproach. It’s a lot about something like guilt. I think that the thought experiment itself is sort of about calling you an asshole. It’s like, “If you didn’t save the child, you’re an asshole.” So everyone’s an asshole.
Rob Wiblin: But look at how you’re living the rest of your life.
Joe Carlsmith: Exactly. I think sometimes you’re an asshole, and we need to be able to notice that. But also, for one thing, it’s actually not clear to me that you’re an asshole for not donating to a charity — that’s not something that we normally think — and I think we should notice that. Also, it doesn’t seem to me like a very healthy or wholehearted basis for engaging with this stuff. I think there are alternatives that are better.
On why bother being good
Rob Wiblin: What are the personal values of yours that motivate you to care to try to help other people, even when it’s kind of a drag, or demoralising, or it feels like you’re not making progress?
Joe Carlsmith: One value that’s important to me, though it’s a little hard to communicate, is something like “looking myself and the world in the eye.” It’s about kind of taking responsibility for what I’m doing; what kind of force I’m going to be in the world in different circumstances; trying to understand myself, understand the world, and understand what in fact I am in relationship to it — and to choose that and endorse that with a sense of agency and ownership.
One way that shows up for me in the context of helping others is trying to take really seriously that my mind is not the world — that the limits of my experience are not the limits of what’s real.
In particular, I wake up and I’m just like Joe every day — every day it’s just Joe stuff; I wake up in the sphere of Joe around me. So Joe stuff is really salient and vivid: there’s this sort of zone — it’s not just my experience, there’s also, like, people and my kitchen — of things that are kind of vivid.
And then there’s a part of the world that my brain is doing a lot less to model — but that doesn’t mean the thing is less real; it’s just my brain is putting in a lot fewer resources to modelling it. So things like other people are just as real as I am. When something happens to me, at least from a certain perspective, that’s not a fundamentally different type of event than when something happens to someone else. So part of living in the real world for me is living in light of that fact, and trying to really stay in connection with just that other people are just as real as I am.
More broadly, when we talk about forms of altruism that are more fully impartial — or trying to ask questions like, “What is really the most good I can do?” — for me, that’s a lot about trying to live in the world as a whole, not artificially limiting which parts of the world I’m treating as real or significant. Because I don’t live in just one part of the world. When I act, I act in a way that affects the whole world, or that can affect the whole world. There’s some sense in which I want to be not imposing some myopia upfront on what is in scope for me. I think those are both core for me in terms of what helping others is about.