ÑóẊîöʼn
7th January 2011, 17:17
Interesting part of a larger article (http://singinst.org/upload/CFAI.html#anthro_reinventing) that helps to clear up a lot of groundless anthropomorphism of Friendly AIs, with particular reference to the concept of retaliation:
2.1: Reinventing retaliation
If you punch a human in the nose, he or she will punch back. If the human doesn't punch back, it's an admirable act of self-restraint, something worthy of note.
Imagine, for a moment, that you walk up and punch an AI in the nose. Does the AI punch back? Perhaps and perhaps not, but punching back will not be instinctive. A sufficiently young AI might stand there and think: "Hm. Someone's fist just bumped into my nose." In a punched human, blood races, adrenaline pumps, the hands form fists, the stance changes, all without conscious attention. For a young AI, focus of attention shifts in response to an unexpected negative event - and that's all.
As the AI thinks about the fist that bumped into vis (http://singinst.org/upload/GISAI/meta/glossary.html#gloss_vis) (13 (http://singinst.org/upload/CFAI.html#foot-13)) nose, it may occur to the AI that this experience may be a repeatable event rather than a one-time event, and since a punch is a negative event, it may be worth thinking about how to prevent future punches, or soften the negativity. An infant AI - one that hasn't learned about social concepts yet - will probably think something like: "Hm. A fist just hit my nose. I'd better not stand here next time."
If the AI is young, but old enough to be explaining human actions by reference to human cognition, ve may know that the fist movement and collision was probably not a coincidence; ve may hypothesize that the collision of human fist with AI's nose was intended, and even go on to hypothesize that the outcome of the punch, the resulting damage to the AI, was a predictable (predictable to the human) result of the punch, and that the purpose of the punch was to damage the AI; that is, the AI models the human's cognition and concludes that the human had a subgoal of {punching the AI}, and that the parent goal was {causing punch-damage to the AI}. Thus, the AI's focus of attention shifts to the human's motives rather than the physical punch itself, and the AI's next goal becomes deducing the human's real motive (that is, the parent goal of {causing punch-damage to the AI}).
There are now three different ways the AI can try to prevent the next punch; by preventing the physical collision itself, by preventing the human from deciding to punch the AI, or by preventing the unknown original motive from recurring. (14 (http://singinst.org/upload/CFAI.html#foot-14)). Note, incidentally, that this improved picture rules out "standing somewhere else" as a solution, since the trajectory of the human's fist would be expected to track the AI's new location.
Punching back is an incredibly nonobvious solution. On the face of it, throwing the punch simply causes punch-damage to a human, something that, considered as an isolated event, the AI has no cause to regard as desirable. When starting from scratch, "punching back" seems as ridiculous as our mothers always said it was - analogous to reacting to a fire alarm by changing your shirt. For an AI to independently reinvent the idea of punching back would be an act of incredible genius and lateral thinking.
Independent reinvention is unlikely, not least because a seed AI (http://singinst.org/upload/GISAI/meta/glossary.html#gloss_seed_ai) would have extensively surfed archives of the Internet or other information sources, and would have encountered thousands, if not millions, of hints to the effect that humans punch back. But consider, for a moment, what a truly independent AI would need to understand before reinventing the concept of retaliation. Ve would need to begin, as stated, with the realization that the human punching ver did so on purpose and with intent to damage. This, in itself, is not such a large assumption; humans are intelligent beings, so there is often a direct mapping between {the results of our actions} and {our goals}. On the other hand, there's a long gap between an AI saying "Hm, this result may correspond to the human's intentions" and a human saying "Hey, you did that on purpose!"
If an infantile AI thinks "Hm, a fist just hit my nose, I'd better not stand here again", then a merely young AI, more experienced in interacting with humans, may apply standard heuristics about apparently inexplicable human actions and say: "Your fist just hit my nose... is that necessary for some reason? Should I be punching myself in the nose every so often?" One imagines the nearby helpful programmer explaining to the AI that, no, there is no valid reason why being punched in the nose is a good thing, after which the young AI turns around and says to the technophobic attacker: "I deduce that you wanted {outcome: AI has been punched in the nose}. Could you please adjust your goal system so that you no longer value {outcome: AI has been punched in the nose}?"
And how would a young AI go about comprehending the concept of "harm" or "attack" or "hostility"? Let us take, as an example, an AI being trained as a citywide traffic controller. The AI understands that (for whatever reason) traffic congestion is bad, and that people getting places on time is good. (15 (http://singinst.org/upload/CFAI.html#foot-15)). The AI understands that, as a child goal of avoiding traffic congestion, ve needs to be good at modeling traffic congestion. Ve understands that, as a child goal of being good at modeling traffic congestion, ve needs at least 512GB of RAM, and needs to have thoughts about traffic that meet or surpass a certain minimal level of efficiency. Ve knows that the programmers are working to improve the efficiency of the thinking process and the efficacy of the thoughts themselves, which is why the programmers' actions in rewriting the AI are desirable from the AI's perspective.
A technophobic human who hates the traffic AI might walk over and remove 1GB of RAM, this being the closest equivalent to punching a traffic AI in the nose. The traffic AI would see the conflict with {subgoal: have at least 512GB of RAM}, and this conflict obviously interferes with the parent goal of {modeling traffic congestion} or the grandparent goal of {reducing traffic congestion}, but how would an AI go about realizing that the technophobic attacker is "targeting the AI", "hating the AI personally", rather than trying to increase traffic congestion?
From the AI's perspective, descriptions of internal cognitive processes show up in a lot of subgoals, maybe even most of the subgoals. But these internal contents don't necessarily get labeled as "me", with everything else being "not-me". The distinction is a useful one, and even a traffic-control AI will eventually formulate the useful categories of "external-world subgoals" and "internal-cognition subgoals", but the division will not necessarily have special privileges; the internal/external division may not be different in kind from the division between "cognitive subgoals that deal with random-access memory" and "cognitive subgoals that deal with disk space". How is a young AI supposed to guess, in advance of the fact, that so many human concepts and thoughts and built-in emotions revolve around "Person X", rather than "Parietal Lobe X" or "Neuron X"? How is the AI supposed to know that it's inherently more likely that a technophobic attacker intends to "injure the AI", rather than "injure the AI's random-access memory" or "injure the city's traffic-control"?
The concept of "injuring the AI", and an understanding of what a human attacker would tend to categorize as "the AI", is a prerequisite to understanding the concept of "hostility towards the AI". If a human really hates someone, she (16 (http://singinst.org/upload/CFAI.html#foot-16)) will balk the enemy at every turn, interfere with every possible subgoal, just to maximize the enemy's frustration. How would an AI understand this?
Perhaps the AI's experience of playing chess, tic-tac-toe, or other two-sided zero-sum games will enable the AI to understand "opposition" - that everything the opponent desires is therefore undesirable to you, and that everything you desire is therefore undesirable to the opponent; that if your opponent has a subgoal, you should have a subgoal of blocking that subgoal's completion, and that if you have a valid subgoal, your opponent will have a subgoal of blocking your subgoal's completion.
Real life is not zero-sum, but the heuristics and predictive assumptions learned from dealing with zero-sum games may work to locally describe the relation between two social enemies. (Even the bitterest of real-life enemies will have certain goal states in common, e.g., nuclear war is bad; but this fact lies beyond the relevance horizon of most interactions.)
The real "Aha!" would be the insight that the attacking human and the AI could be in a relation analogous to players on opposing sides in a game of chess. This is a very powerful and deeply fundamental analogy. As humans, we tend to take this perspective for granted; we were born with it. It is, in fact, a deep part of how we humans define the self. It is part of how we define being a person, this cognitive assumption that you and I and everyone else are all nodes in a social network, players in a hugely multisided non-zero-sum game. For a human, myself is a great, embracing symbol that gathers in the-player-that-is-this-goal-system and the-part-of-reality-that-is-inside-this-mind and the-body-that-sees-and-moves-for-this-viewpoint. For a human, these are all the same thing, part of what is meant by "I".
Even so, the concept of game theory is not sufficient to reinvent "retaliation"; it is simply a prerequisite. Understanding the Axelrod-and-Hamilton "Tit for Tat" strategy (17 (http://singinst.org/upload/CFAI.html#foot-17)) is sufficient to suggest "Hey, maybe I should think about punching the attacker back!" but not sufficient to justify the suggestion, once made. (18 (http://singinst.org/upload/CFAI.html#foot-18)). One question that has gone unasked throughout this entire analysis is "What is the utility of punching back?" A human, in punching back, may or may not pause to consider whether it will bring any benefits; but, even so, we instinctively retaliate as a result of reflexes that were an advantage in the ancestral environment.
The evolutionary benefit of retaliation lies in the probable adjustment to the future behavior of others. People - humans, anyway - are less likely to hit you if they think you're likely to hit back. Hitting back an individual human may or may not work to prevent a recurrence, or to decrease the number of recurrences, but failure to hit back may result in more punches coming your way. Of course, this relies on the attackers having the kind of psychologies that take possible retaliation into account - an AI, to fully grok the reason for retaliation, would have to understand this aspect of human (evolutionary) psychology as well. The Tit for Tat strategy doesn't succeed in isolation; it only succeeds when some of the strategies it interacts with also take the behavior of the opponent/partner into account.
In reality, for an AI, returning a punch is likely to be far more trouble than it's worth. There's that whole crowd of onlookers who go: "Look! It punched a human being! Kill it! Kill the AI!" Of course, if the AI understands enough evolutionary psychology and game theory to employ Tit for Tat, but not enough human psychology to understand the concept of "double standard", ve may not realize how much trouble punching a human may land ver in. This is one of the reasons why a well-designed Friendly AI would have a precautionary injunction that says something along the lines of "Don't attack a human, even if it looks like a good idea at the time." A Friendly AI would also regard damage to an unconsenting human as intrinsically undesirable under the Friendliness supergoal content. (19 (http://singinst.org/upload/CFAI.html#foot-19)).
So what's the upshot? Primarily, what I want to convey to you is the sense of a blank slate, of starting from scratch, of a vast silence space - for young AIs, at least. This blank slate doesn't mean that AIs, even young AIs, are necessarily slow. For an AI with thirty-two 2-Ghz processors, the maximum speed on a reasonably linear chain of thought (one that doesn't require more than thirty-two parallel processing threads) is 2 billion ticks per second, as opposed to the 200 ticks per second of the hundred trillion synapses of the human brain. Perhaps an AI - on being physically punched for the very first time on vis very first encounter with a hostile human - could think through what ve remembers from surfing Usenet archives, and what ve read in newspapers, and books of fiction and nonfiction, and what ve knows about human evolutionary psychology, and conclude that returning the punch is the best course of action, and do it all in time to return the punch as fast as any instinct-driven human - assuming that 32 processors is enough speed to handle the complex motor actions required.
This isn't a literal possibility (20 (http://singinst.org/upload/CFAI.html#foot-20)) unless the AI is transhuman (http://singinst.org/upload/CFAI.html#gloss_transhuman), and a transhuman AI would be mature enough to know all about social networks; ve wouldn't need to start from scratch. The first time some young AI gets punched in the nose, ve's likely to react in one of the ways described earlier, or some other way just as surreal, if ve reacts at all. It'll take some extended conversations with the programmers about evolutionary psychology before the AI understands what's going on. But, the second time the AI gets punched in the nose, it shouldn't take any time at all - by human standards - to run through a chain of logic that's easy to reverify. It's inventing that takes massive computing power and human confirmation; retracing your own footprints is likely to be a fairly serial process that can be consummated in a tenth, a hundredth of a second. If re-spawning a child goal from a parent goal is a serial task, one that doesn't invoke any computationally intensive subprocesses, then the AI can retrace the path from supergoal Friendliness content to the correct course of action (retaliation, rebuking the attacker, whatever) in a human eyeblink.
It's easy enough to accept that a young AI is a vast silent space. For a human, blood races, adrenaline pumps, eyes narrow, hands form fists. For a human, counterattack is instinctive - the correct level of counterattack, the socially acceptable level of counterattack; a damaging counterpunch rather than a lethal bullet. Our instincts are context-sensitive and hedged 'round by other instincts. Our instincts are complex functional adaptations with multiple moving parts, often hosted on a modular chunk of brainware. Complex functional adaptations don't just materialize spontaneously in source code, just as complex dishes like pizza don't suddenly start growing on palm trees. Thus, a young AI might choose to retaliate, but would certainly not feel the need to retaliate, as a human would. To a young AI, retaliation is not an instinct; retaliation is just another subgoal.
What about a more mature AI, especially one that can rewrite vis own source code? Regardless of whether it would be a good idea, it would certainly be possible for a seed AI to create a reflex for instant retaliation.
There are several clear reasons why humans have evolved a retaliation instinct, rather than a retaliation logic. The primary reason is that a retaliation instinct is easier to evolve. The retaliation instinct evolved long before general intelligence, so evolving a retaliation logic first would not just have been more difficult, but actually impossible. Also, evolution tends to arrive at procedural solutions rather than declarative solutions, because a component of a complex procedural solution can be functional in its own right.
If genes could, somehow, store declarative knowledge, the first piece of knowledge stored would be "Punching back is good," which is simpler than "Punching back is good because it decreases the chance of future punches," which is simpler than "Punching back decreases the chance of future punches by modifying others' behavior", which is simpler than "Punching back modifies others' behavior because, on seeing you punch back, they'll project an increased chance of you punching back if they punch you, which makes them less likely to punch back." All of this is moot, since as far as I know, nobody has ever run across a case of genes storing abstract knowledge. (By this I mean knowledge stored in the same format used for episodic memories or declarative semantic knowledge.)
Abstract knowledge cannot evolve incrementally and therefore it does not evolve at all. This fact, by itself, is enough to completely explain away human use of retaliation instincts rather than retaliation logic, and we must go on to consider independently whether a retaliation instinct or a retaliation logic is more useful. For humans, I think that a retaliation instinct is more useful, or at least it's more of an evolutionary advantage. Even if we had conscious control over our endocrine systems, so we could deliberately choose to pump adrenaline, we would still be shot down by the sheer human-slowness of abstract thought. We are massively parallel systems running at 200Hz. When you're massively parallel you can afford to precompute things, and when you run at 200Hz you must precompute things because everything has to be done in very few serial steps.
When you run at 2 billion ticks per second, the overhead of recreating and rechecking a few previously-thought-out child goals is comparatively trivial next to all the other actions those subgoals entail, including complex, creative, parallel-intensive tasks such as computing motor actions (if ve decides to punch back) or thinking up an effective opening line (if ve decides to rebuke the attacker). For a human, interposing any extra serial step in a process is a huge time burden, but massively parallel operations vanish into the maw of trillions and trillions of synapses. For an AI, an extra serial step performed at the start of a thought is nearly unnoticeable (21 (http://singinst.org/upload/CFAI.html#foot-21)), but trying to imitate human parallelism would bring the AI to vis knees. (A later AI may have access to computing power rivalling human parallelism, but would presumably still have the faster serial speed.)
The conclusion: From an AI's perspective, a retaliation logic is probably considerably more effective than a retaliation instinct. The retaliation logic has a context-sensitivity that the retaliation instinct lacks. Suppose a young AI has some set of ultimate goals - traffic reduction, or hopefully Friendliness supergoals, or whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. A maturing AI would have the capability to replace the retaliation logic with a retaliation instinct, or with a retaliation imperative having supergoal status. The instinct would be slightly faster - 0.01% faster, or perhaps 1% faster (eliminating a serial step at the start of the algorithm saves very little time, and most of the computational cost is computation-intensive motor logic or persuasive speech production). However, in doing so, the AI would lose a substantial amount of the context sensitivity of the retaliation logic - that is, from the perspective of the current set of supergoals, the supergoals that the AI uses to decide whether or not to implement the optimization.
Changing retaliation to an independent supergoal would affect, not just the AI's speed, but the AI's ultimate decisions. From the perspective of the current set of supergoals, this new set of decisions would be suboptimal. Suppose a young AI has some set of ultimate goals - traffic reduction, Friendliness, whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. The maturing AI then considers whether changing the logic to an independent supergoal or optimized instinct is a valid tradeoff. The benefit is shaving one millisecond off the time to initiate retaliation. The cost is that the altered AI will execute retaliation in certain contexts where the present AI would not come to that decision, perhaps at great cost to the present AI's supergoals (traffic reduction, Friendliness, etc). Since recreating the retaliation subgoal is a relatively minor computational cost, the AI will almost certainly choose to have retaliation remain strictly dependent on the supergoals.
Why do I keep making this point, especially when I believe that a Friendly seed AI can and should live out vis entire lifecycle without ever retaliating against a single human being? I'm trying to drive a stake through the heart of a certain conversation I keep having.
Somebody: "But what happens if the AI decides to do [something only a human would want] ?"
Me: "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code. I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody: "But everyone needs to do [whatever] because [personal philosophy], so the AI will decide to do it as well."
Me: "Yes, doing [whatever] is sometimes useful. But even if the AI decides to do [whatever] because it serves [Friendliness supergoal] under [contrived scenario], that's not the same as having an independent desire to do [whatever]."
Somebody: "Yes, that's what I've been saying: The AI will see that [whatever] is useful and decide to start doing it. So now we need to worry about [scenario in which <whatever> is catastrophically unFriendly]."
Me: "But the AI won't have an independent desire to do [whatever]. The AI will only do [whatever] when it serves the supergoals. A Friendly AI would never do [whatever] if it stomps on the Friendliness supergoals."
Somebody: "I don't understand. You've admitted that [whatever] is useful. Obviously, the AI will alter itself so it does [whatever] instinctively."
Me: "The AI doesn't need to give verself an instinct in order to do [whatever]; if doing [whatever] really is useful, then the AI can see that and do [whatever] as a consequence of pre-existing supergoals, and only when [whatever] serves those supergoals."
Somebody: "But an instinct is more efficient, so the AI will alter itself to do [whatever] automatically."
Me: "Only for humans. For an AI, [complex explanation of the cognitive differences between having 32 2-gigahertz processors and 100 trillion 200-hertz synapses], so making [whatever] an independent supergoal would only be infinitesimally more efficient."
Somebody: "Yes, but it is more efficient! So the AI will do it."
Me: "It's not more efficient from the perspective of a Friendly AI if it results in [something catastrophically unFriendly]. To the exact extent that an instinct is context-insensitive, which is what you're worried about, a Friendly AI won't think that making [whatever] context-insensitive, with [horrifying consequences], is worth the infinitesimal improvement in speed."
Retaliation was chosen as a sample target because it's easy to explain, easy to see as anthropomorphic, and a good stand-in for the general case. Though "retaliation" in particular has little or no relevance to Friendly AI - I wouldn't want any Friendly AI to start dabbling in retaliation, whether or not it looked like a good idea at the time - what has been said of "retaliation" is true for the general case.
---
There's a lot more in the link, but I thought I'd highlight the most interesting bit.
2.1: Reinventing retaliation
If you punch a human in the nose, he or she will punch back. If the human doesn't punch back, it's an admirable act of self-restraint, something worthy of note.
Imagine, for a moment, that you walk up and punch an AI in the nose. Does the AI punch back? Perhaps and perhaps not, but punching back will not be instinctive. A sufficiently young AI might stand there and think: "Hm. Someone's fist just bumped into my nose." In a punched human, blood races, adrenaline pumps, the hands form fists, the stance changes, all without conscious attention. For a young AI, focus of attention shifts in response to an unexpected negative event - and that's all.
As the AI thinks about the fist that bumped into vis (http://singinst.org/upload/GISAI/meta/glossary.html#gloss_vis) (13 (http://singinst.org/upload/CFAI.html#foot-13)) nose, it may occur to the AI that this experience may be a repeatable event rather than a one-time event, and since a punch is a negative event, it may be worth thinking about how to prevent future punches, or soften the negativity. An infant AI - one that hasn't learned about social concepts yet - will probably think something like: "Hm. A fist just hit my nose. I'd better not stand here next time."
If the AI is young, but old enough to be explaining human actions by reference to human cognition, ve may know that the fist movement and collision was probably not a coincidence; ve may hypothesize that the collision of human fist with AI's nose was intended, and even go on to hypothesize that the outcome of the punch, the resulting damage to the AI, was a predictable (predictable to the human) result of the punch, and that the purpose of the punch was to damage the AI; that is, the AI models the human's cognition and concludes that the human had a subgoal of {punching the AI}, and that the parent goal was {causing punch-damage to the AI}. Thus, the AI's focus of attention shifts to the human's motives rather than the physical punch itself, and the AI's next goal becomes deducing the human's real motive (that is, the parent goal of {causing punch-damage to the AI}).
There are now three different ways the AI can try to prevent the next punch; by preventing the physical collision itself, by preventing the human from deciding to punch the AI, or by preventing the unknown original motive from recurring. (14 (http://singinst.org/upload/CFAI.html#foot-14)). Note, incidentally, that this improved picture rules out "standing somewhere else" as a solution, since the trajectory of the human's fist would be expected to track the AI's new location.
Punching back is an incredibly nonobvious solution. On the face of it, throwing the punch simply causes punch-damage to a human, something that, considered as an isolated event, the AI has no cause to regard as desirable. When starting from scratch, "punching back" seems as ridiculous as our mothers always said it was - analogous to reacting to a fire alarm by changing your shirt. For an AI to independently reinvent the idea of punching back would be an act of incredible genius and lateral thinking.
Independent reinvention is unlikely, not least because a seed AI (http://singinst.org/upload/GISAI/meta/glossary.html#gloss_seed_ai) would have extensively surfed archives of the Internet or other information sources, and would have encountered thousands, if not millions, of hints to the effect that humans punch back. But consider, for a moment, what a truly independent AI would need to understand before reinventing the concept of retaliation. Ve would need to begin, as stated, with the realization that the human punching ver did so on purpose and with intent to damage. This, in itself, is not such a large assumption; humans are intelligent beings, so there is often a direct mapping between {the results of our actions} and {our goals}. On the other hand, there's a long gap between an AI saying "Hm, this result may correspond to the human's intentions" and a human saying "Hey, you did that on purpose!"
If an infantile AI thinks "Hm, a fist just hit my nose, I'd better not stand here again", then a merely young AI, more experienced in interacting with humans, may apply standard heuristics about apparently inexplicable human actions and say: "Your fist just hit my nose... is that necessary for some reason? Should I be punching myself in the nose every so often?" One imagines the nearby helpful programmer explaining to the AI that, no, there is no valid reason why being punched in the nose is a good thing, after which the young AI turns around and says to the technophobic attacker: "I deduce that you wanted {outcome: AI has been punched in the nose}. Could you please adjust your goal system so that you no longer value {outcome: AI has been punched in the nose}?"
And how would a young AI go about comprehending the concept of "harm" or "attack" or "hostility"? Let us take, as an example, an AI being trained as a citywide traffic controller. The AI understands that (for whatever reason) traffic congestion is bad, and that people getting places on time is good. (15 (http://singinst.org/upload/CFAI.html#foot-15)). The AI understands that, as a child goal of avoiding traffic congestion, ve needs to be good at modeling traffic congestion. Ve understands that, as a child goal of being good at modeling traffic congestion, ve needs at least 512GB of RAM, and needs to have thoughts about traffic that meet or surpass a certain minimal level of efficiency. Ve knows that the programmers are working to improve the efficiency of the thinking process and the efficacy of the thoughts themselves, which is why the programmers' actions in rewriting the AI are desirable from the AI's perspective.
A technophobic human who hates the traffic AI might walk over and remove 1GB of RAM, this being the closest equivalent to punching a traffic AI in the nose. The traffic AI would see the conflict with {subgoal: have at least 512GB of RAM}, and this conflict obviously interferes with the parent goal of {modeling traffic congestion} or the grandparent goal of {reducing traffic congestion}, but how would an AI go about realizing that the technophobic attacker is "targeting the AI", "hating the AI personally", rather than trying to increase traffic congestion?
From the AI's perspective, descriptions of internal cognitive processes show up in a lot of subgoals, maybe even most of the subgoals. But these internal contents don't necessarily get labeled as "me", with everything else being "not-me". The distinction is a useful one, and even a traffic-control AI will eventually formulate the useful categories of "external-world subgoals" and "internal-cognition subgoals", but the division will not necessarily have special privileges; the internal/external division may not be different in kind from the division between "cognitive subgoals that deal with random-access memory" and "cognitive subgoals that deal with disk space". How is a young AI supposed to guess, in advance of the fact, that so many human concepts and thoughts and built-in emotions revolve around "Person X", rather than "Parietal Lobe X" or "Neuron X"? How is the AI supposed to know that it's inherently more likely that a technophobic attacker intends to "injure the AI", rather than "injure the AI's random-access memory" or "injure the city's traffic-control"?
The concept of "injuring the AI", and an understanding of what a human attacker would tend to categorize as "the AI", is a prerequisite to understanding the concept of "hostility towards the AI". If a human really hates someone, she (16 (http://singinst.org/upload/CFAI.html#foot-16)) will balk the enemy at every turn, interfere with every possible subgoal, just to maximize the enemy's frustration. How would an AI understand this?
Perhaps the AI's experience of playing chess, tic-tac-toe, or other two-sided zero-sum games will enable the AI to understand "opposition" - that everything the opponent desires is therefore undesirable to you, and that everything you desire is therefore undesirable to the opponent; that if your opponent has a subgoal, you should have a subgoal of blocking that subgoal's completion, and that if you have a valid subgoal, your opponent will have a subgoal of blocking your subgoal's completion.
Real life is not zero-sum, but the heuristics and predictive assumptions learned from dealing with zero-sum games may work to locally describe the relation between two social enemies. (Even the bitterest of real-life enemies will have certain goal states in common, e.g., nuclear war is bad; but this fact lies beyond the relevance horizon of most interactions.)
The real "Aha!" would be the insight that the attacking human and the AI could be in a relation analogous to players on opposing sides in a game of chess. This is a very powerful and deeply fundamental analogy. As humans, we tend to take this perspective for granted; we were born with it. It is, in fact, a deep part of how we humans define the self. It is part of how we define being a person, this cognitive assumption that you and I and everyone else are all nodes in a social network, players in a hugely multisided non-zero-sum game. For a human, myself is a great, embracing symbol that gathers in the-player-that-is-this-goal-system and the-part-of-reality-that-is-inside-this-mind and the-body-that-sees-and-moves-for-this-viewpoint. For a human, these are all the same thing, part of what is meant by "I".
Even so, the concept of game theory is not sufficient to reinvent "retaliation"; it is simply a prerequisite. Understanding the Axelrod-and-Hamilton "Tit for Tat" strategy (17 (http://singinst.org/upload/CFAI.html#foot-17)) is sufficient to suggest "Hey, maybe I should think about punching the attacker back!" but not sufficient to justify the suggestion, once made. (18 (http://singinst.org/upload/CFAI.html#foot-18)). One question that has gone unasked throughout this entire analysis is "What is the utility of punching back?" A human, in punching back, may or may not pause to consider whether it will bring any benefits; but, even so, we instinctively retaliate as a result of reflexes that were an advantage in the ancestral environment.
The evolutionary benefit of retaliation lies in the probable adjustment to the future behavior of others. People - humans, anyway - are less likely to hit you if they think you're likely to hit back. Hitting back an individual human may or may not work to prevent a recurrence, or to decrease the number of recurrences, but failure to hit back may result in more punches coming your way. Of course, this relies on the attackers having the kind of psychologies that take possible retaliation into account - an AI, to fully grok the reason for retaliation, would have to understand this aspect of human (evolutionary) psychology as well. The Tit for Tat strategy doesn't succeed in isolation; it only succeeds when some of the strategies it interacts with also take the behavior of the opponent/partner into account.
In reality, for an AI, returning a punch is likely to be far more trouble than it's worth. There's that whole crowd of onlookers who go: "Look! It punched a human being! Kill it! Kill the AI!" Of course, if the AI understands enough evolutionary psychology and game theory to employ Tit for Tat, but not enough human psychology to understand the concept of "double standard", ve may not realize how much trouble punching a human may land ver in. This is one of the reasons why a well-designed Friendly AI would have a precautionary injunction that says something along the lines of "Don't attack a human, even if it looks like a good idea at the time." A Friendly AI would also regard damage to an unconsenting human as intrinsically undesirable under the Friendliness supergoal content. (19 (http://singinst.org/upload/CFAI.html#foot-19)).
So what's the upshot? Primarily, what I want to convey to you is the sense of a blank slate, of starting from scratch, of a vast silence space - for young AIs, at least. This blank slate doesn't mean that AIs, even young AIs, are necessarily slow. For an AI with thirty-two 2-Ghz processors, the maximum speed on a reasonably linear chain of thought (one that doesn't require more than thirty-two parallel processing threads) is 2 billion ticks per second, as opposed to the 200 ticks per second of the hundred trillion synapses of the human brain. Perhaps an AI - on being physically punched for the very first time on vis very first encounter with a hostile human - could think through what ve remembers from surfing Usenet archives, and what ve read in newspapers, and books of fiction and nonfiction, and what ve knows about human evolutionary psychology, and conclude that returning the punch is the best course of action, and do it all in time to return the punch as fast as any instinct-driven human - assuming that 32 processors is enough speed to handle the complex motor actions required.
This isn't a literal possibility (20 (http://singinst.org/upload/CFAI.html#foot-20)) unless the AI is transhuman (http://singinst.org/upload/CFAI.html#gloss_transhuman), and a transhuman AI would be mature enough to know all about social networks; ve wouldn't need to start from scratch. The first time some young AI gets punched in the nose, ve's likely to react in one of the ways described earlier, or some other way just as surreal, if ve reacts at all. It'll take some extended conversations with the programmers about evolutionary psychology before the AI understands what's going on. But, the second time the AI gets punched in the nose, it shouldn't take any time at all - by human standards - to run through a chain of logic that's easy to reverify. It's inventing that takes massive computing power and human confirmation; retracing your own footprints is likely to be a fairly serial process that can be consummated in a tenth, a hundredth of a second. If re-spawning a child goal from a parent goal is a serial task, one that doesn't invoke any computationally intensive subprocesses, then the AI can retrace the path from supergoal Friendliness content to the correct course of action (retaliation, rebuking the attacker, whatever) in a human eyeblink.
It's easy enough to accept that a young AI is a vast silent space. For a human, blood races, adrenaline pumps, eyes narrow, hands form fists. For a human, counterattack is instinctive - the correct level of counterattack, the socially acceptable level of counterattack; a damaging counterpunch rather than a lethal bullet. Our instincts are context-sensitive and hedged 'round by other instincts. Our instincts are complex functional adaptations with multiple moving parts, often hosted on a modular chunk of brainware. Complex functional adaptations don't just materialize spontaneously in source code, just as complex dishes like pizza don't suddenly start growing on palm trees. Thus, a young AI might choose to retaliate, but would certainly not feel the need to retaliate, as a human would. To a young AI, retaliation is not an instinct; retaliation is just another subgoal.
What about a more mature AI, especially one that can rewrite vis own source code? Regardless of whether it would be a good idea, it would certainly be possible for a seed AI to create a reflex for instant retaliation.
There are several clear reasons why humans have evolved a retaliation instinct, rather than a retaliation logic. The primary reason is that a retaliation instinct is easier to evolve. The retaliation instinct evolved long before general intelligence, so evolving a retaliation logic first would not just have been more difficult, but actually impossible. Also, evolution tends to arrive at procedural solutions rather than declarative solutions, because a component of a complex procedural solution can be functional in its own right.
If genes could, somehow, store declarative knowledge, the first piece of knowledge stored would be "Punching back is good," which is simpler than "Punching back is good because it decreases the chance of future punches," which is simpler than "Punching back decreases the chance of future punches by modifying others' behavior", which is simpler than "Punching back modifies others' behavior because, on seeing you punch back, they'll project an increased chance of you punching back if they punch you, which makes them less likely to punch back." All of this is moot, since as far as I know, nobody has ever run across a case of genes storing abstract knowledge. (By this I mean knowledge stored in the same format used for episodic memories or declarative semantic knowledge.)
Abstract knowledge cannot evolve incrementally and therefore it does not evolve at all. This fact, by itself, is enough to completely explain away human use of retaliation instincts rather than retaliation logic, and we must go on to consider independently whether a retaliation instinct or a retaliation logic is more useful. For humans, I think that a retaliation instinct is more useful, or at least it's more of an evolutionary advantage. Even if we had conscious control over our endocrine systems, so we could deliberately choose to pump adrenaline, we would still be shot down by the sheer human-slowness of abstract thought. We are massively parallel systems running at 200Hz. When you're massively parallel you can afford to precompute things, and when you run at 200Hz you must precompute things because everything has to be done in very few serial steps.
When you run at 2 billion ticks per second, the overhead of recreating and rechecking a few previously-thought-out child goals is comparatively trivial next to all the other actions those subgoals entail, including complex, creative, parallel-intensive tasks such as computing motor actions (if ve decides to punch back) or thinking up an effective opening line (if ve decides to rebuke the attacker). For a human, interposing any extra serial step in a process is a huge time burden, but massively parallel operations vanish into the maw of trillions and trillions of synapses. For an AI, an extra serial step performed at the start of a thought is nearly unnoticeable (21 (http://singinst.org/upload/CFAI.html#foot-21)), but trying to imitate human parallelism would bring the AI to vis knees. (A later AI may have access to computing power rivalling human parallelism, but would presumably still have the faster serial speed.)
The conclusion: From an AI's perspective, a retaliation logic is probably considerably more effective than a retaliation instinct. The retaliation logic has a context-sensitivity that the retaliation instinct lacks. Suppose a young AI has some set of ultimate goals - traffic reduction, or hopefully Friendliness supergoals, or whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. A maturing AI would have the capability to replace the retaliation logic with a retaliation instinct, or with a retaliation imperative having supergoal status. The instinct would be slightly faster - 0.01% faster, or perhaps 1% faster (eliminating a serial step at the start of the algorithm saves very little time, and most of the computational cost is computation-intensive motor logic or persuasive speech production). However, in doing so, the AI would lose a substantial amount of the context sensitivity of the retaliation logic - that is, from the perspective of the current set of supergoals, the supergoals that the AI uses to decide whether or not to implement the optimization.
Changing retaliation to an independent supergoal would affect, not just the AI's speed, but the AI's ultimate decisions. From the perspective of the current set of supergoals, this new set of decisions would be suboptimal. Suppose a young AI has some set of ultimate goals - traffic reduction, Friendliness, whatever. The young AI, by hypothesis, invents (or is taught) a retaliation logic under which "punching back" serves these supergoals. The maturing AI then considers whether changing the logic to an independent supergoal or optimized instinct is a valid tradeoff. The benefit is shaving one millisecond off the time to initiate retaliation. The cost is that the altered AI will execute retaliation in certain contexts where the present AI would not come to that decision, perhaps at great cost to the present AI's supergoals (traffic reduction, Friendliness, etc). Since recreating the retaliation subgoal is a relatively minor computational cost, the AI will almost certainly choose to have retaliation remain strictly dependent on the supergoals.
Why do I keep making this point, especially when I believe that a Friendly seed AI can and should live out vis entire lifecycle without ever retaliating against a single human being? I'm trying to drive a stake through the heart of a certain conversation I keep having.
Somebody: "But what happens if the AI decides to do [something only a human would want] ?"
Me: "Ve won't want to do [whatever] because the instinct for doing [whatever] is a complex functional adaptation, and complex functional adaptations don't materialize in source code. I mean, it's understandable that humans want to do [whatever] because of [selection pressure], but you can't reason from that to AIs."
Somebody: "But everyone needs to do [whatever] because [personal philosophy], so the AI will decide to do it as well."
Me: "Yes, doing [whatever] is sometimes useful. But even if the AI decides to do [whatever] because it serves [Friendliness supergoal] under [contrived scenario], that's not the same as having an independent desire to do [whatever]."
Somebody: "Yes, that's what I've been saying: The AI will see that [whatever] is useful and decide to start doing it. So now we need to worry about [scenario in which <whatever> is catastrophically unFriendly]."
Me: "But the AI won't have an independent desire to do [whatever]. The AI will only do [whatever] when it serves the supergoals. A Friendly AI would never do [whatever] if it stomps on the Friendliness supergoals."
Somebody: "I don't understand. You've admitted that [whatever] is useful. Obviously, the AI will alter itself so it does [whatever] instinctively."
Me: "The AI doesn't need to give verself an instinct in order to do [whatever]; if doing [whatever] really is useful, then the AI can see that and do [whatever] as a consequence of pre-existing supergoals, and only when [whatever] serves those supergoals."
Somebody: "But an instinct is more efficient, so the AI will alter itself to do [whatever] automatically."
Me: "Only for humans. For an AI, [complex explanation of the cognitive differences between having 32 2-gigahertz processors and 100 trillion 200-hertz synapses], so making [whatever] an independent supergoal would only be infinitesimally more efficient."
Somebody: "Yes, but it is more efficient! So the AI will do it."
Me: "It's not more efficient from the perspective of a Friendly AI if it results in [something catastrophically unFriendly]. To the exact extent that an instinct is context-insensitive, which is what you're worried about, a Friendly AI won't think that making [whatever] context-insensitive, with [horrifying consequences], is worth the infinitesimal improvement in speed."
Retaliation was chosen as a sample target because it's easy to explain, easy to see as anthropomorphic, and a good stand-in for the general case. Though "retaliation" in particular has little or no relevance to Friendly AI - I wouldn't want any Friendly AI to start dabbling in retaliation, whether or not it looked like a good idea at the time - what has been said of "retaliation" is true for the general case.
---
There's a lot more in the link, but I thought I'd highlight the most interesting bit.