What Can AI Safety Teach Us about Strategic Planning? Part 1: Foundations

We publish this series of blogs as Microsoft has just launched its combination of ChatGPT and Bing, generating considerable excitement in almost every industry. Dall-E and ChatGPT (and, most recently, GPT4) catapulted Artificial Intelligence (back) into the limelight, largely thanks to them representing a significant advance in the widespread practical application of a technology that until this year has been mostly reserved for data scientists, researchers, academics and so on.

It is, of course, not without its problems. However, in this five-part series of blogs I want to use the thinking around artificial general intelligence and the possible dangers of its widespread use that significantly pre-date the developments we’ve seen in the last few years as a starting point for a different topic.

There are similarities that can be drawn between academic and philosophical thinking around yet-to-be-invented super-intelligent agents and the importance of value alignment to ensure desirable outcomes from the goals we set for them, and the setting and execution of strategy in organisations.

In 2004 Eliezer Yudowski published his paper Coherent Extrapolated Volition, which proposes an approach to value alignment based upon setting a series of “initial conditions”. Those “initial conditions” are set in such a way that an AGI is able to adapt the values it has to changing cultural paradigms, human needs and so on, rather than requiring it to follow a series of set-in-stone instructions about how it should behave forever. The theory is that this can offer the best possible chance that the needs of humanity will be served in perpetuity by the AGI, rather than only until the point at which society changes beyond the recognition of the people who set the initial values to which the AGI is bound to adhere.

Using this proposal as the foundation, I will outline how established principles of systems thinking, AI value alignment, consciousness and cognitive neuroscience can be used effectively to regard organisations as entities. We can use that point of view to set effective cultural and strategic states to maximise the organisation’s effectiveness in the delivery of value, and the happiness of the people who work for it.

Foundational thinking

Let’s begin by exploring some of the issues surrounding value alignment. Value alignment deals with the challenge of making sure that some yet-to-be-invented superintelligent artificial general intelligence (AGI) is “human compatible”. Sometimes referred to as the singularity, the theory goes that once a sufficiently beyond-human level artificial intelligence is unleashed on the world, our chance to make sure that it will act in the best interests of humanity has already been missed. So, we’d better make sure that we get that right before we commit!

Broadly speaking the possible consequences of a failure of value alignment can be understood by grasping the concepts of instrumental convergence and perverse instantiation. Since both are relevant to the analogy I hope to draw in this series between AI safety and strategic planning, let’s do a quick summary of both.

Instrumental Convergence (or: I’m sorry Dave, I’m afraid I can’t do that.)

Instrumental convergence covers the idea that there are certain behaviours (instrumental values) that would be emergent properties of a system with goals related to a future state over which it has been given control. Essentially, a sufficiently intelligent system will pursue certain goals that are not explicitly expressed as a defined objective because, as Nick Bostrom puts it in his book “Superintelligence: Paths, Dangers, Strategies” “there are some objectives that are useful intermediaries to the achievement of almost any final goal”.

Bostrom outlines five key instrumental values, perhaps the most interesting of which is self preservation. In “Human Compatible”, Stuart Russell offers the following explanation for it:

“Suppose a machine has the objective of fetching the coffee. If it is sufficiently intelligent, it will certainly understand that it will fail in its objective if it is switched off before completing its mission. Thus, the objective of fetching coffee creates, as a necessary subgoal, the objective of disabling the off switch.”

(Sidenote: if you ever thought “just turn them off” was a suitable solution to the old ROBOTS TAKING OVER THE WORLD! problem: no.)

This does raise a question around what might constitute extraneous goals or targets in business. Assuming that the collective cognitive capacity of those who work for a company clears the “sufficiently intelligent” bar, they should understand that working towards the continued survival of the organisation that employs them is in their best interests. Therefore, are requirements defining “we must make this much money” really necessary? Can we not trust our people to maximise profit on behalf of the organisation?

Then there’s

goal-content integrity, which relates to a resistance to the alteration of a final goal;
cognitive enhancement (you’re more likely to achieve your final goal if you can think faster, more effectively and more efficiently);
technological perfection (improvements to the infrastructure that facilitates that thinking) and, finally;
resource acquisition, which is a requirement for the fulfilment of cognitive enhancement and technological perfection.

Perverse Instantiation (or: THAT’S NOT WHAT I MEANT!)

The issue of perverse instantiation is, essentially, the law of unintended consequences placed in the hands of an agent with almost unimaginable power to create change in the pursuit of the goals it has been given. In the case of superintelligent AGI, this creates the risk of existential catastrophe even when the goal itself appears entirely benign.

Consider the following scenario: you’ve been given the job of maximising paperclip manufacturing at a paperclip factory. Not a great deal of scope there for bringing about a particularly Dilbert-esque end to humanity, one might imagine.

Put that task into the hands of a competently designed intelligent system however, and the situation changes somewhat dramatically as the system pursues its instrumental values of self-preservation, goal-content integrity, cognitive enhancement, technological perfection and, crucially, resource acquisition, against an extremely broadly-defined end goal. This system only has the goal of “maximising paperclip production”... nothing in that goal specifies an end point to that maximisation or the cost at which that maximisation becomes undesirable to the paperclip manufacturer, never mind humanity at large. There is nothing to stop that paperclip maximiser turning the entire universe into paperclips and, if it’s clever enough, it will.

Benign goal; malignant execution.

Some other classics might be:

Goal: make us smile. Solution: paralyse human faces into permanent grins.
Goal: eliminate misery. Solution: humanity is the source of its own unhappiness so wipe out the human race.

The strategic singularity (or: over to you, proles)

The parallels between AI value alignment, specifically the singularity, and strategic planning appear clear in some respects. Once a strategy has been defined and communicated—placed in the hands of the business, as it were—that is your “singularity”. In the classic model, execution of that strategy is now beholden to the business’s interpretation of what is written and the final goals that are established in that strategy document: “these are the things that this business needs to achieve, now go and get on with it. See you in a year.”

Of course, when we say “the business” we’re referring to all the people that work there, so an ideal strategy should allow flexibility and freedom for those tasked with executing it to do so as they see fit. This will be based on their expertise and experience (sell more fridges/maximise paper clip production), but the door should not be left open for decisions that reveal themselves to be completely bonkers after the fact (allowing a KPI to exist that inadvertently prevents sales/giving permission to tile the entire universe with paper clips).

John Harsanyi defines the way these decisions might manifest themselves as states of true preferences, actual preferences and informed preferences.

True preferences are described in Utilitarianism and Beyond as “the hypothetical preferences he would have, had he all relevant information and had made full use of this information.” Actual preferences fall short of true preferences due to “erroneous factual beliefs, careless logical analysis and strong emotions which hinder rational decision making”. Finally, informed preferences are “the hypothetical preferences he would have if he had all the relevant information and had made full use of the information”.

Consider how much more frequently actual preferences appear than true or informed preferences in day to day work!

I suppose, simplistically, you might call those idealism, realism and hindsight, but Herbert Simon’s theory of bounded rationality as outlined in Donella Meadows’ “Thinking in Systems: A Primer” does a better job of summarising them: “people make quite reasonable decisions based on the information they have. But they don’t have perfect information, especially about more distant parts of the system.”

Supply chain: “Reducing stock-holding will dramatically reduce our overheads, which will save us money so targets for supply chain managers will be based on reducing stock holding.”

Sales: “To maximise our revenue we need to ensure supply can meet demand, particularly in the case of time-limited promotional activities so we must ensure sufficient stock holding prior to the promotion’s launch.”

Both those positions are entirely rational when a strategic objective is “increase EBITDA to 10%”, even though they are clearly in conflict with one another. The problem is that neither was aware of the other’s target that was the result of the EBITDA goal interpretation, and now both the salesperson and supply chain manager’s livelihoods (bonuses) are tied to the completion of conflicting goals.

Next time: How can we overcome these problems?

A solution for underpinning the thinking behind future planning and strategy in business exists in a theory for AI friendliness/value alignment outlined by Eliezer Yudkowski in his 2004 essay: Coherent Extrapolated Volition (CEV).

We’ll explore this in Part 2: Coherent Extrapolated Volition (or: don’t be evil forever).

What Can AI Safety Teach Us about Strategic Planning? Part 1: Foundations

Foundational thinking

Instrumental Convergence (or: I’m sorry Dave, I’m afraid I can’t do that.)

Perverse Instantiation (or: THAT’S NOT WHAT I MEANT!)

The strategic singularity (or: over to you, proles)

Next time: How can we overcome these problems?

What We Do @CS: The Yin and Yang of Technical and Organisati...

Container Solutions and Isovalent Announce Strategic Partner...

Talk to sales

Stay In Touch