Elements of Computational Philosophy, Vol. I

By Paul Bricman,Correspondence at Elfia Bezou-Vrakatseli, Thomas Feeney, and Yimeng Xie.

Fig. Personalized overview.

There is no such thing as a perfect summary, as each reader tends to be interested in slightly different topics. That said, we attempt to strike a balance through several summaries tailored to specific interests.

I am interested in




For the best experience, view full screen on desktop.

Note to the Reader

The document you just stumbled upon is largely a chimera which brings together features of seemingly incompatible media artifacts. First, it incorporates the literary license of a novel, in order to facilitate the occasional act of daring idealism. Second, it incorporates the subject matter of an academic paper, with each chapter documenting original research at the intersection of machine learning, epistemology, and metaethics—be it conceptual, theoretical, or applied work. Third, it incorporates the reactivity of web pages, with occasional interactive widgets and explorable explanations being interweaved with plain text, besides plenty of links being scattered around. Accordingly, it is by far best viewed on desktop and in full screen.

This “mixed media” concoction might make it daunting to engage with the artifact at hand, and for good reason. However, approaching it as a short book appears to elicit the most appropriate expectations, helping the reader reliably take advantage of the best of all worlds. Indeed, the “Save to PDF” version of the artifact weighs in at around a hundred pages, depending on the page layout. Given this, churning through the entire document in one sitting is generally advised against. Instead, individual chapters make for much more manageable chunks—roughly mapping to separate academic papers coming together in a complementary constellation. One should expect a leisurely read of the whole artifact—one which includes playing with explorables, as well as following a few linked rabbit holes—to take three or four hours. That said, skimming might only take one or two hours, while exploring each and every nook might take forever.

Due to the interdisciplinary nature of the present endeavor, each reader will inevitably be more familiar with some concepts than others. Given this, side notes are used extensively throughout the volume, in an attempt to elaborate on domain-specific jargon, although they can also house tangential curiosities.Such as this one, inspired in formatting by Edward Tufte’s books. That said, due to building on frameworks as vast as e.g. network theory, enactivism, or defeasible logic, we are forced to limit ourselves to brief, local contextualizations of the concepts being invoked, but do direct readers towards more comprehensive resources in order to satisfy their potential interest.

The present artifact is also a living one, with minor revisions potentially being made over time. Following the meatier main chapters, the appendix also describes several actionable ways of engaging with the team behind the present artifacts. Until then, have a pleasant read.

Table of Contents

In the first half of this volume, we slowly build towards a theoretical framework of language model reasoning centered around dialectics,The term can refer to a multitude of things. Nicholas Rescher opens his Dialectics by arguing that “it is, as it were, the alchemy of philosophy. It is all things to all men: to some, the most rigorous procedure for exact and cogent thinking; to others, a way of getting outside the established rules—an “anything goes” process for breaking through to unfettered innovations of thinking. For some it is the quintessential method of inquiring thought, for others the quintessential antimethod.” We limit ourselves here to the meaning of a regimented dialogue between parties which is generally held explicitly. the age-old practice of truth-seeking through regimented dialogue. In the second half, we attempt to apply the framework by exploring a number of specific use cases in AI alignment.

Ch. I, Dialectical Power Dynamics

In the first chapter, we devise an automated way of evaluating parties engaged in a debate of arbitrary length held in natural language. Our algorithm is inspired by the argumentation-theoretic notion of pragmatic validity and the epistemological notion of coherentism, yet its concrete implementation relies on heuristics for node “authority” from network theory.

  1. The Kaleidoscope of Reasonableness
  2. Beliefs as Means or Ends
  3. Carving the Algorithm
  4. ArgRank

Ch. II, Deliberative Arms Race

In the second chapter, we describe the process of obtaining DebateGPT, a language model fine-tuned to simulate increasingly pertinent debates by attempting to excel at the previously described evaluation. This novel training regime incorporates a self-play paradigm, runs mostly on synthetic data, and we assess has a chance of bootstrapping language model reasoning into superhuman territory, assuming a number of probable future advancements.

  1. Brief Review of Language Models
  2. Obtaining DebateGPT
  3. The Elephant in the Weights
  4. The Kinetics of Reason
  5. Climbing Schild’s Ladder

Ch. III, Defeat & Defense

In the third chapter, we continue by framing the reasoning capabilities of language models such as DebateGPT in terms of bounded defensibility, the amount of computational “firepower” a grouping of arguments can withstand before being defeated. The amount of time at their disposal, the number of available tries, and the reasoning capabilities of the (simulated) agents in question represent some of those bounds.

  1. Brief Review of Non-Monotonic Logic
  2. Argument Is War
  3. Bounded Defensibility

Ch. IV, Deployment Strategies

In the fourth chapter, we then go on to suggest a number of applications of this theoretical framework in AI alignment which attempt to synthesize several avenues of investigation: debate, simulators, interpretability, interaction games, long reflection, shard theory, and others. These deployment strategies are meant to scale in synchrony with the bootstrapped reasoning capabilities hinted at in previous chapters.

  1. Brief Review of Alignment
  2. Building on Cyborgism
  3. Building on Simulators & Assistance Games
  4. Building on Long Reflection
  5. Connections to Logical Inductors & Classical Debate

Ch. V, Benchmarking Artifacts

In the fifth chapter, we attempt to gauge the technical feasibility of the applications described previously. In the process, we uncover both fundamental issues and opportunities for improvement at the interface between engineering and philosophy.

  1. Benchmarking ArgRank’s Dependencies
  2. Benchmarking ArgRank
  3. Benchmarking DebateGPT

Ch. VI, Truth, Debate, Machines

In this final chapter, we sidestep all contingent bottlenecks arising from the current state of machine learning engineering, and go on to philosophically assess the maximalist ideal of an automated truth-seeking engine. In the process, we stumble upon several cruxes which have been vigorously debated since early modern philosophy.

  1. Truth & Debate
  2. Debate & Machines
  3. Truth & Machines


Finally, we zoom out of object-level technicalities and focus instead on concrete ways in which readers can engage more actively with the present volume, as well as with upcoming ones.


Paul Bricman would like to thank the Long-Term Future Fund for financial support during the project, Stability AI for providing the computational resources necessary to train DebateGPT, Conjecture for providing the space to explore related ideas during a previous research fellowship, AI Safety Camp for providing context and resources for the entire team to conduct their investigations (see Chapter V and Chapter VI), Alexander Gietelink Oldenziel for stimulating discussion on reasoning and epistemics, Adam Shimi for instilling an acute awareness of employed ontologies, Anna and Alexandra Elbakyan for infrastructural support, Amber Dawn for patient editing, external advisors for helping refine the transparency level, as well as the authors of all prior work which this volume remixes. All that is lacking or in excess is his error.

Elfia Bezou-Vrakatseli was supported by UK Research and Innovation in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence.

Thomas Feeney was supported by a sabbatical grant from the University of Saint Thomas, where he is an Associate Professor of Philosophy.

Ch. I, Dialectical Power Dynamics

The Kaleidoscope of Reasonableness

Ask a scholar of pure mathematics, computer science, or formal logic what makes an instance of argumentation valid, and they will likely highlight the relevance of making sure that the conclusion follows logically from the premises for each individual reasoning step. More often than not, this implies relying on a host of approved types of inference (e.g. modus ponens), while making sure to steer clear of degenerate ones (i.e. fallacies). This conception of reasonableness is often referred to as “geometrical,” due to its implicit call for only building on solid premises and constructing arguments using an idealized set of operations. [Straightedge and compass constructions]( involve the challenging creation of varied shapes using a limited set of legal moves—here, a pentagon.
Straightedge and compass constructions involve the challenging creation of varied shapes using a limited set of legal moves—here, a pentagon.

Ask a scholar of argumentation theory the same question, and—following a cursory smile indicating they have long been waiting for this—they will likely not hesitate to point out that there have been multiple prominent schools of thought over time which have advocated different, often incompatible, conceptions of reasonableness. Each one is backed by a different rationale, has its own features and shortcomings, and has emerged in a different cultural setting, often separated by thousands of years and kilometers. Let us briefly sample this kaleidoscope of conceptions.

For instance, Perelman and Olbrechts-Tyteca suggested a conception of reasonableness grounded in rhetoric. According to this conception, an instance of argumentation is valid if and only if it succeeds in persuading a group of individuals of its conclusion. While the most visible shortcoming of this conception is that it can quickly degenerate into sophistry,The sophists were teachers of rhetoric in ancient Greece, notorious for “equipping” individuals with techniques for making a strong case in court, regardless of the truthfulness of their position. They were sneered at by virtually all their contemporary philosophers, who frowned upon them for not seeking wisdom, but merely monetizing the skill of persuasion as a means of taking advantage of others. its proponents highlight the possibility of grounding reasonableness in the persuasion of a particularly rational audience. This litmus test can be further extended to involve the persuasion of an idealized omniscient agent, but also of oneself, by framing self-deliberation as self-persuasion.

Indeed, the object of the theory of argumentation is the study of the discursive techniques allowing us to induce or to increase the mind's adherence to the theses presented for its assent.

Chaïm Perelman & Lucie Olbrechts-Tyteca, The New Rhetoric

As a different example, Toulmin and the school of thought which emerged around his ideas advocated for a conception of reasonableness which incorporates domain-specificity. If the geometrical conception often requires abstracting statements into propositional atoms (i.e. \(P\) could equally well denote “All men are mortal.” and “Socrates is a man.”), Toulmin argues that arguments are often substantial, relying on domain-specific means of warranting conclusions, as opposed to standardized analytical operations on abstracted symbols. For instance, the fact that a study in the natural sciences has conformed to best practices in terms of replicability and reproducibility can be used to back its findings. In contrast, people working in pure mathematics might not rely on peer-reviewed empirical studies to back theorems, but might want to verify proofs using specialized software. The practices of ensuring sound reasoning in finance are yet again different, relying more on computer simulations and historical performance. Several spin-off schools of thought echoed Toulmin’s dissatisfaction regarding the limited practical reach of highly-analytical formal logic, including the slightly hectic field of informal logic.

A man demonstrates his rationality, not by a commitment to fixed ideas, stereotyped procedures, or immutable concepts, but by the manner in which, and the occasions on which, he changes those ideas, procedures, and concepts.

Stephen Toulmin, Human Understanding

To expand our collection of reasonableness conceptions even further, the pragma-dialectical framework developed by van Eemeren and Grootendorst grounds reasonableness in dialectics. In this context, an instance of argumentation is valid if and only if there exists no strategy to be employed by an opponent in a structured dialogue which manages to undermine it.One might wonder if failure to disprove a claim can truly provide justification in support of said claim. While failure to disprove a claim given limited effort can only tell us so much about its standing, an exhaustive search for a counterexample which ends up fruitless can in fact be used as conclusive justification. This is the case, for instance, in Beth’s method of semantic tableaux, a classic method in proof theory which involves a systematic search for counterexamples to a set of proposed formulas.

It is relevant to note that this technique is typically employed in situations involving search spaces of modest size (e.g. a proposed biconditional can only be challenged by challenging one of its two “constituent” conditionals, in turn), thus rendering even the exhaustive version of the search tractable on present hardware. Unfortunately, we will be forced to discard this luxury later on, as we venture into (dis)proving positions in the arena of open-ended natural language.
The proponents of this framework are particularly interested in enabling effective reasoning in a wide range of situations, rather than only in some higher realm of abstractions. That is why the conception of reasonableness on which their framework rests has a major pragmatic component. The regimented dialogue can be carried out by real individuals and can target a wide range of matters, from the most mundane to the most consequential. The ruleset of available tactics is simply made available to the individuals engaged in dialogue at the beginning, while strategies can be as diverse as forcing the opponent into self-contradiction or exploiting their (involuntary) support. That said, the framework can also be brought closer to formal dialectics in order to account for idealized reasoning by employing perfectly rational agents as discussants, in a move similar to that of Perelman and Olbrechts-Tyteca.

Accordingly, the prime aims of the present discussion are to exhibit the sociocommunal roots of the foundations of rationality, to provide an instrument for the critique of scepticism implicit in the cognitive solipsism of the Cartesian approach, and to illuminate the communal and controversy-oriented aspects of argumentation and inquiry—scientific inquiry in particular.

Nicholas Rescher, Dialectics

Indeed, for many centuries at a time, logic has been but part of dialectics, rather than a field of its own.For instance, during the Middle Ages. Refer to Section 2.10.1 of the Handbook of Argumentation Theory for a more detailed account. The hot and cold relationship of the two disciplines over the centuries has perhaps been the closest thing argumentation theory has ever had to juicy gossip, with rhetoric a controversial element to complete the triad. It is difficult to overstate the reliance of contemporary mathematics, both pure and applied, on the foundation of formal logic, and so the very idea of scholars erecting an edifice of theory on a different foundation tends to induce vertigo. The very possibility that notions as elementary as conjunction, disjunction, and negation could be defined on the basis of a regimented dialogue instead of a logicThe term logic is used here as countable in reference to the broad range of three-valued, four-valued, many-valued, and modal logics which compete with classical two-valued logic. sounds exceedingly exotic to the contemporary ear, outside a handful of niches like ludics, game semantics, and interactive computation.

I am not interested in erecting a building but in having the foundations of possible buildings transparently before me. [...] If the place I want to reach could only be climbed up to by a ladder, I would give up trying to get there. For the place to which I really have to go is one that I must actually be at already. Anything that can be reached with a ladder does not interest me. [...] You must climb down to the sources to see them all side by side, the disregarded & the preferred.

Ludwig Wittgenstein, Culture and Value

To bring this section to an end, we have completed a very brief tour of several conceptions of reasonableness with the purpose of highlighting the breadth of approaches devised by scholars through the ages. Determining which conception of reasonableness is itself more reasonable is the subject of vigorous debate in present-day argumentation theory. Each conception appears better suited to deal with certain aspects of argumentation, while lacking in other respects. In the rest of this chapter, we will build on top of many of the conceptions listed above, in an attempt to develop an automated pipeline for estimating reasonableness as a single floating-point number.

Beliefs as Means or Ends

All of the three disciplines which fall under the umbrella of argumentation theory (i.e. logic, dialectic, and rhetoric) can be said to house both work which frames reasoning as a means of reaching a conclusion based on beliefs, but also work which frames reasoning as a continuous process of forming beliefs. The former can be seen as a building block of the latter, yet the latter can also be seen as a prerequisite of the former. However, those connections become somewhat counterproductive when considering the fact that the very same procedures in their entirety (i.e. sketching out a proof, engaging in a dialogue) can be motivated by both perspectives in different contexts.

In logic, for instance, writing a proof involves mainly chasing after a conclusion. How exactly one actually navigates the “game tree” of available moves in search of the finish line is up to the logician in question. However, it is only of secondary relevance that each step of the proof involves obtaining an entirely new well-formed formula.This technical term describes a collection of propositional atoms (e.g. \(P\)) brought together by connectives (e.g. \(¬\), also known as negation) in a “legal” way (e.g. \(¬P\), read “not P”, instead of \(P¬\)). It is a syntactic constraint on how such elements are arranged, preventing the equivalent of a logician being able to express “flies bird.” It is more relevant for being able to construct interesting proofs, rather than making sure logicians write correctly. They are merely intermediate steps required to succeed in proving a certain conclusion. In other words, means.

Fig. Logic proof.

The Fitch-style logic proof below involves three premises. Each step of the proof yields an intermediate well-formed formula using an approved operation. For instance, double negation is used to cancel out the two chained negations from step 4, thus arriving at step 5.

In contrast, several direct applications of this formalism involve a much greater focus on the “beliefs” formed as the reasoning process unfolds. For instance, expert systems had been a major topic in early symbolic AI, involving the constant expansion of a knowledge base from a set of initial statements, using rules intimately tied to the ones above. An entire inference engine was dedicated to the task of deriving a tiny bit of new knowledge from the knowledge which had been accumulated up to that point. Systems relying on forward chaining in particular even tried to “grow” the knowledge base as much as the inference engine allowed before any other subsequent operation. The procedure is essentially identical to the previous one—there is only a shift in focus baked into the very ontology being used.Ontology refers here to the conceptual framework which a specific intellectual tradition uses to deconstruct their object of study. Not to be confused with the typically hierarchical (digital) knowledge bases which were popular in early symbolic AI.

Reasoning is a transition in thought, where some beliefs provide the ground or reason for coming to another.

Jonathan Adler and Lance Rips, Reasoning

The dichotomy of beliefs as means or ends is also echoed in the field of rhetoric, with self-persuasion being seen as a continuous process capable of incrementally “emitting” beliefs. Additionally, the closely-related enterprise of persuasion research investigates ways of successfully persuading individuals of certain specific statements, but also the multi-stage process of getting there. The interactions between the advocated belief and the individual’s previous epistemicEpistemics refers here to that which is known. Epistemology refers to the study of knowing and knowledge in general. baggage can become quite complex, making the act of persuasion better described as a multi-step intervention on an individual’s belief system, prompting them to gradually “move” towards the advocated position by forming or discarding intermediate beliefs.

Perhaps in none of the three disciplines concerned with reasoning is the means-ends dichotomy more clear than in dialectics. On one hand, there are dialectical formalisms which focus entirely on a single statement supported by one proponent and contested by one opponent. In these cases, the entire aim of a regimented dialogue is to lead the discussants towards a conclusion regarding whether or not the statement in question is true. Similar to the logic proofs above, there are intermediate utterances made by the two parties as the dialogue unfolds—artifacts which are crucial for making the dialogue function in the first place—yet which are seen as mere scaffolding around the deliberation of the main statement.

However, things get much more colorful when looking at dialectical formalisms designed to be open-ended and perpetual. The “games” proposed by Jaakko Hintikka are perhaps the most salient example. For instance, Hintikka describes dialogues whose participants are motivated by an interest in being surprised and learning more, broadly referred to as “information-seeking dialogues.”

An answer to our problem can be given by making the payoff of the game for a given player dependent on the information-content of his (her) final thesis (more properly speaking, the conjunction of all his theses). The more informative this thesis, the higher the payoff.

Jaakko Hintikka & Esa Saarinen, Information-seeking dialogues

One’s partner in such information-seeking dialogues need not necessarily be human, as illustrated below.

We may think of "a" as a scientist or inquirer of some other kind and "b" as Nature or as a comparable impersonal source of information. [...] We may further think of "B" as a constant basic theory of "b" while the different choices of the "A" represent different hypotheses "a" is trying to prove by "putting questions to Nature."

Jaakko Hintikka, On the logic of an interrogative model of scientific inquiry

Echoing Hintikka’s almost literary move from Man versus Man to Man versus Nature in his dialectical dealings, while at the same time departing somewhat from Hintikka’s reliance on information theory, the co-founders of the Erlangen School write as follows.The “constructivist” school of thought, rather than the theological one which shares the same name, although as one can see, the line gets blurry at times. Also, one of them was the founder of modern-day game semantics, as well.

If one compares this agonistic origin of logic with modern conceptions, according to which logic is the system of rules that, whenever they are applied to some arbitrary true sentences, will lead one to further truths, then it will be but too obvious that the Greek agon has come to be a dull game of solitaire. In the original two-person game only God, secularized: “Nature,” who is in possession of all true sentences, would still qualify as an opponent. Facing Him there is the human individual – or perhaps the individual as a representative of humanity – devoted to the game of patience: Starting from sentences that were, so he believes, obtained from God before, or snatched away from Him, and following rules of logic, he is to gain more and more sentences.

Paul Lorenzen & Kuno Lorenz, Dialogische logik

Nicholas Rescher takes this style of thinking even further by moving from one scientist engaged in truth-seeking to the whole scientific enterprise as a generalized “sociocommunal” process of deliberation about the nature of the world.

At this stage, however, the social or communal aspect of the scientific enterprise comes crucially into play. For once a scientifically significant thesis is propounded by someone, the "scientific community" provides (1) certain opponents, in the form of self-appointed critics who challenge this thesis in an adversary manner, probing for its weak points and seeking to impede its acceptance, and (2) a larger, neutral body of concerned but otherwise uncommitted bystanders, who effectively act as arbiters of the "dispute."

Nicholas Rescher, Dialectics

The idea of a competition of ideas unfolding in the arena of society allows us to complete our recent sequence of conceptual hops with an altogether different proto-discipline. The controversial field of memetics casts the beliefs which populate the collective consciousness in a Darwinian light. Belief systems are said to ruthlessly compete with one another for the scarce resource of human psyche. Instead of developing an immune system to fight off parasites, a belief system might “adapt” to prescribe the prohibition of “hosts” to adopt other beliefs. Particularly ambitious proponents of this perspective claim that culture in its totality can be explained in evolutionary terms, just as life has been explained to an impressive extent by evolutionary biology.Much of what is controversial about memetics is due to such observations being made not-so-tactfully in the context of religions as belief systems. For instance, one could say that preventing interfaith marriage or prescribing the same religion for children are adaptations of an ideology meant to protect or conquer psychological territory. One might imagine a more permissive belief system not standing the test of time.

Accordingly, the framing of memetics is significantly more influential in more secular circles (although there is vigorous criticism there, too). One might go so as far as to say that those communities are more “vulnerable” to being “infected” with the meme of memetics.

It has also been argued that memes require a “fertile psychological soil” in which to emerge, which is largely a function of socioeconomic dynamics. For instance, Stoicism and Epicureanism might have required the idiosyncracies of the Hellenistic age. Additionally, just like two species define each other’s niche, there might also be interactions between belief systems, as Erich Fromm eloquently articulates in Escape from Freedom:

“This readiness for submission of one’s self to extrahuman ends was actually prepared by Protestantism, although nothing was further from Luther’s or Calvin’s mind than the approval of such supremacy of economic activities. But in their theological teaching they had laid the ground for this development by breaking man’s spiritual backbone, his feeling of dignity and pride, by teaching him that activity had no further aims outside of himself.”

While memetics and dialectics are worlds apart in terms of the employed formalisms and motivations, with dialectics relying on a carefully regimented procedure for effective reasoning while memetics relying on a supremely lax definition of spontaneous adaptations for understanding culture, the bridge between the two will prove key in later chapters. It will allow us to combine the rigidity of reasoning through regimented procedures with the evolutionary fluidity of models forged out of the selective pressures of empirical risk minimization.Technical term employed in statistical learning theory to denote “training a model to perform well on the training data,” but without all the ontological baggage associated with the anthropocentric metaphor of the model learning how to perform well on tasks as a person might. This will become abundantly clear in the second half of Chapter II, when we employ the connection as a conceptual building block, but will also resurface towards the end of Chapter IV.

To bring this section to an end, we have explored the pervasive dichotomy between beliefs as means and beliefs as ends, which appears to cut through virtually all disciplines concerned with the study of reasoning, and beyond. Going forward, we will include the flexibility necessary to accommodate both of those perspectives as a constraint for our algorithm.

Carving the Algorithm

We have explored various ways in which scholars have conceived of the reasonableness of arguments. This will now serve us well, expanding the space of candidate algorithms which are backed by such a rationale—our raw material. We will make our way through this expansion by means of constraints, using them to cut down the search space. As we establish what our algorithm is not, the algorithm will slowly become crisper and better defined, each cut collapsing possibilities along some axis.

First, we would like the automated pipeline to be able to accommodate the richness of natural language. We would like to avoid the invariably lossy compressionIn contrast to lossless compression, which can be reverted so as to perfectly reconstruct the original artifact, lossy compression involves some amount of information loss, meaning that perfect reconstruction becomes impossible, although getting e.g. 90% of the way is sufficient in many applications. For instance, JPEG involves lossy image compression (with a configurable amount of loss, even), while PNG involves lossless image compression. involved in converting beliefs into a brittle mosaic of propositional atoms, predicates, and connectives. Any such analytical statement can trivially be expressed in natural language (i.e. by describing it, albeit verbosely), while the reverse task has prompted an army of logics, each tailored to one very specific facet of reality (e.g. temporal logic for time), while still remaining “an open challenge.” For sure, natural language itself is neither the perfect mirror, nor the very blueprint of the world, as many classics sincerely seemed to have hoped. Still, it is one less step of information loss from reality, and it is reality we are ultimately interested in reasoning about. Furthermore, while analytical statements might succeed in capturing essential features in highly structured domains, the notions we are most interested in when wielding unprecedented amounts of computation (e.g. human values, long-term flourishing) seem to resist being abstracted into a handful of sufficient statistics.Sufficient statistics refer to the minimum number of measures which are enough to explain most of a statistical object. For instance, a “bell curve” distribution can be described in its entirety using two values: one measure of centrality and one measure of spread. For a more enthusiastic take on whether notions as messy and abstract as e.g. human values can be explained in full using a handful of appropriate factors, refer to John Wentworth’s agenda for devising an appropriate objective for a powerful AI to chase safely. DeepMind leader Demis Hassabis, quotes the related neuro-symbolic integration challenge as the one they “spend most of [their] time thinking about,” and notes that “we’re still quite far from [solving it], and no one quite knows how to bridge that [neuro-symbolic] chasm. […] It’s a bit of a mystery.” Natural language, for better or worse, has evolved to serve us in communicating effectively about such topics, mediating much, if not most, of our culture.

The medium is the message.

Marshall McLuhan, Understanding Media

Let us take stock of the search space following the application of this first constraint. Unfortunately, we are largely forced to abandon the geometrical conception as a motivating rationale to base our algorithm on, not because of its elegance or crispness, which we are disheartened to leave behind, but because its limited set of legal inferences are better suited for highly-structured domains rather than the messy world as a whole. The critical conceptions of reasonableness make up some of the remaining options, defining reasonable instances of argumentation as those which systematically resist being undermined by opponents. However, having largely left behind the foundationalistEpistemological term referring to the idea of knowledge building on top of a foundation of other knowledge, gradually ascending as one gets to “stand on the shoulders of giants.” This stance is implicitly baked into the structure of a logic proof (with premises being neatly separated, indicating some amount of epistemic privilege), but need not refer only to analytical expressions. luxury of building on top of axiomatic premises, we risk the following failure mode. An opponent can simply contradict what the proponent says and win! It is quickly game over due to the opponent having the freedom not to build on the same foundation, rendering naive contrarianism a winning strategy. Our second constraint on the search space is then the necessity of accounting precisely for this thorny problem. Who has the epistemic high-ground when there is no absolute reference frame involved, when each party advocates their own? What can we substitute foundationalism with in order to gracefully handle such situations?

Following this second cut through the search space, we are fortunately left with more than nothing. Similar to how the geometrical conception of reasonableness is typically used in tandem with the epistemological notion of foundationalism, most of the critical conceptions of reasonableness actually incorporate the epistemological notion of coherentism. According to this view, it is those parties whose stances are coherent (e.g. which do not contradict themselves) which should be favored. Not only should the opponent undermine the proponent, but they should make a good case for their opposition, being able to stand resolutely against the inevitable counterattacks. In his Introduction to Multiagent Systems, Michael Wooldridge uses the phrasing mutually defensive to describe a constellation of statements which effectively support each other in fending off attacks.A pioneer of multi-agent systems, Wooldridge has collaborated with DeepMind on topics not too far removed from the present work. When the individuals which make up a multi-agent system interact, the emergent phenomenon is intimately tied with dialectical formalisms we have previously touched on. We discuss this connection more in Chapter II, when we further build on important work done at DeepMind. Laurence BonJour, a prominent epistemologist and proponent of coherentism, further expands this position to account for other ways of knowing. Belief systems which not only are internally coherent, but which are also coherent with perceived observations of the world are even more promising, since they move away from a potentially unhinged solipsismRoughly, the philosophical position associated with living in one’s head. Debate around solipsism in popular culture tends to focus on the implied loss of touch with reality, including the ignorance of one’s alleged responsibility to contribute to the world. while still steering clear of foundationalism. One could imagine further expanding this coherence heuristicThis somewhat technical term describes a “rule of thumb.” Here, we are arguing about using a party’s coherence as an indicator to provide the grounds for breaking the tie. to the self-perception act seemingly involved in memory, another major way of knowing explored by epistemologists.

As the features of our algorithm become more prominent, we move from crudely eliminating large chunks of possibility space towards subtler polishing touches. What actually makes a position internally coherent? Conversely, what makes the opponent’s position not be coherent with the proponent’s, as a prerequisite in undermining it? For a start, we might argue that statements which contradict each other are not coherent. In contrast, statements which generally support each other might be better described as such. This is not all, as a group of statements might also be argued to be coherent by virtue of coming together in the act of contradicting an external statement which threatens to undermine them all—the enemy of my enemy is my friend. It seems that those simpler notions of support and contradiction between statements, although composable in increasingly complex arrangements, can provide a basis for our notion of coherence. However, given the high bar we previously set for ourselves—to deal with the messiness of natural language, and through it, with that of the world at large—how could we estimate whether an arbitrary statement entails or contradicts another? Things get further complicated by the domain-specific knowledge required to evaluate many of these connections, as highlighted by Toulmin. The third constraint of our algorithm is therefore the ability to discern such relations between fragments of natural language as a basis for gauging coherence.

Fortunately, there already are systems out there which can help us determine how fragments of natural language relate to each other. Language models tasked with natural language inference—the natural language processing task which involves determining whether a statement supports another, contradicts it, or none of the above—have achieved impressive performance.The state-of-the-art on the Stanford Natural Language Inference (SNLI) benchmark was 93% in mid-2021, reports Papers With Code. Those models have been optimized to match human labelers in classifying hundreds of thousands of hand-crafted statement pairs, determining whether there is an (asymmetrical) entailment, contradiction, or neutral relation between them. The best models at the task tend to incorporate large amounts of unstructured knowledge gained through a previous “pretraining” stage, and are then “fine-tuned” to approximate human judgement in this more structured statement-statement-label setting.We will cover the pretrain/fine-tune paradigm in more detail later on, as we train DebateGPT using a related approach in Chapter II. One might argue that the optimization process incentivizes these models to soak up domain-specific knowledge about which inferences are warranted as an instrumental goal in solving the task. Upon achieving high performance, the natural language inference models will have had, by necessity, internalized both (1) knowledge about the world, and (2) knowledge about whether that knowledge backs certain inferences and warrants certain conclusions. While these models will conveniently satisfy our present Toulminian needs, they will prove limiting later on, as we set our sights on superhuman reasoning. At the very end of Chapter II, we will be forced to move beyond the “intelligence by proxy” trick involved in “merely” imitating human judgement, and explore more principled means of gauging coherence in order to get a grip—however loose—on the epistemic terra incognita.

However, the coherence of parties is more than the sum of individual relations between statements. What if each party contributed a dozen statements, some of which support each other and some of which are actively at odds with each other? Even worse, what if pairs which connect two different parties also vary wildly in their valence? What of the second-order effects briefly hinted at previously, with statements attacking a common enemy? What of higher-order effects? Who is to win when everybody is supporting each other to some extent, while also attacking everybody to a certain degree, while also reporting important amounts of in-fighting? We desperately need a way of making sense of this chaos.

Fortunately, network theorists have long grappled with problems involving countless elements being interweaved into the most complex of fabrics. Be it a billion people in a society, a billion pages on the internet, or a billion machines networked together digitally, network theory has helped us gain insight into the underlying structure of those systems. For instance, it can help determine whether people strongly rely on certain factors when associating with others (e.g. assortative mixing by race), identify the most influential pages based on the support they garner from other influential pages (e.g. node centrality at early Google), or identify similar users based on whether they relate to other entities in a similar way (e.g. structural equivalence at early Facebook). Those applications are intimately tied to our present concerns, and it should come as no surprise that network theory has also been used in argumentation theory. As a prominent example, consider Phan Minh Dung’s abstract argumentation systems, the ones Wooldridge was referring to when using the phrase mutually defensive. If one represents statements as nodes and the relations between them as directed edges, it then becomes possible to systematically identify relevant structures inside the argument graph. For instance, a set of arguments is said to be admissible if and only if (1) it is conflict-free (i.e. there are no internal arguments which attack each other), and (2) the arguments are acceptable (i.e. for every argument which attacks it, there is one in the set which attacks it back).

Fig. Dung's abstract argumentation systems.

The system below is composed of seven statements. Five of them are part of a preferred, stable, and grounded extension, all technical terms denoting various properties of interest in the context of the argument graph.

Hover over a node to see explanations. Drag around to rearrange.

For the best experience, view full screen on desktop.

While Dung’s formalism is at once extremely elegant and relevant to the issue of making sense of an interconnected fabric of arguments, it is not enough. The formalism has two important shortcomings. First, there is limited nuance in how one statement relates to a second (i.e it either attacks it or it does not). Second, the arguments themselves are similarly limited in terms of the “privilege” of being part of the defined groupings (i.e. either an argument is part of, say, the preferred extension or it is not). This general lack of nuance is detrimental in two ways. For one, it has trouble in handling the messiness of the world, with no possibility of a statement only lending some degree of support to another one. In addition, it lacks reward shaping—the recognition of gradual, subtle, incremental shifts in reasonableness, a property essential for using it as part of a learning signal in Chapter II.

Fortunately, we can overcome both of these shortcomings relatively easily. Instead of using the natural language inference models as binary classifiers (i.e. “contradiction” versus “no contradiction”, as Dung’s formalism might suggest), and instead also of using them as ternary classifiers (i.e. for contradition, entailment, and neutral relations, as they are used natively), we take a step back from the discretized outputs and make use of the raw logits of the model.Models need to be end-to-end differentiable in order to be optimized using gradient descent, so that they can take small steps towards being better at the task. This often means working with continuous functions, which is also the case for these models. Behind the label they output denoting the relation between the input pair of statements, there are three continuous numbers, one for each class. Turning them into a discrete label is trivial (i.e. just pick the one predicted to be most likely), but trying to hit a continuous target enables nuanced feedback, which in turn enables learning. This is also what we are trying to provide with the automated pipeline for later models, but we are currently trying to extract this continuity from those previous models which compose the pipeline, by bypassing the discrete classes and working with the “raw” class logits. This allows us to weigh the arcs which lead from one argument to another, using one number per arc, ranging from \(0.0\) for a full-on attack to \(1.0\) for full-on support, with \(0.5\) denoting a generally neutral relation.There is a subtle issue when working with continuous outputs which have been optimized in the context of discretized tasks. If the model would output \(0.2\) for “entailment,” that does not mean that it estimates a \(20\%\) chance that the input sentence pair captures an entailment. Those output values, also called pseudo-probabilities, are generally not calibrated. Rather, they tend to be overconfident, due to the model typically being optimized against a metric which does not favor calibrated outputs. There are many ways of calibrating output probabilities explored in the field of uncertainty quantification (e.g. condition outputs on empirical proportions: \(100\) outputs of around \(0.2\) for “entailment” should actually correspond to “entailment” being correct about \(20\%\) of the time). However, they are rarely used in practice. This does not mean that a higher estimate for “entailment” does not correspond to the model gauging it as more likely, it is just that the value should not be interpreted as directly proportional to estimated probability, but merely as a rough signal. Following this switch from directed edges to weighted directed edges, we now attempt to replace Dung’s black-or-white cliques with a fuzzier alternative to enable subtler evaluation of arguments, and by extension, of parties.

It turns out that simply applying Google’s classic PageRankPerhaps the most iconic algorithm for node centrality, the task of estimating the “authority” of each node in a graph. Originally developed for ranking web pages on early Google Search, PageRank works by recursively nudging a page’s rank based on the ranks of the pages which reference it. If many authoritative pages link to our page, then their “authority” will also “leak” into ours. But how can one know how authoritative those other pages were in the first place? Similarly, they might be referenced by other authoritative sources. This massive chicken-egg problem is solved by starting out with a baseline degree authority for each page, and conducting this “osmosis” until the node values convergence. to the argument graph yields an evaluation which matches many of our previous intuitions. If one argument is overwhelmingly supported by many other arguments, then it receives a good rating. However, if those other arguments are systematically attacked, it only gets a mediocre rating. Similarly, if one argument is overwhelmingly attacked by many other arguments, then it receives a low rating; but if those other arguments are systematically attacked, its rating is not hurt much. A group of arguments which support each other and systematically target external attackers find themselves in good standing. Ditto for strategically positioning oneself in order to derive support from the opponent. In contrast, a group of arguments which exhibit a lot of in-fighting relative to the support lent to third-parties will not find themselves in such a good standing. Ditto for stepping right into the opponent's line of fire. If we simply average the ratings held by all the utterances of a party, we finally obtain an estimate of the party’s aggregate authority, similar to the authoritative sources promoted on search engines.

Notice also how capitalizing on the graph representation of the arguments contributed by parties fits with our shift away from foundationalism. In contrast to the quite linear structure of logic proofs, where each well-formed formula is built on the foundation of what came before it, starting off with the premises, the graph of arguments is inherently non-linear. There are no privileged or foundational nodes—there are just nodes. The notions of “above” and “below” are not well-defined across the flattened constellation of utterances. This has another added benefit. If at a later time we eliminate one particularly “dated” statement from a constellation, it will not instantly bring down the entire structure built around it. The non-linear structure allows for more than an epistemic Jenga, constantly on the brink of collapse. It can house self-sufficient and resilient belief systems, recursively supplying their own reason(s) for being. This “decentralized” flexibility allows us to deal with our final constraint—the accommodation of both beliefs as means and as ends, as mentioned in the previous section. While our algorithm is already well-equipped to deal with a brief encounter of parties (i.e. by providing the means of spotting the epistemic high-ground after a finite number of rounds), it can also allow for utterances to constantly pop in and out of a sliding window across time, enabling a constant scaffolding for the parties’ transitions in belief. Besides, the homogeneity of the argument graph also levels the roles of the parties—one statement’s proponent is another’s opponent. There is no fundamental difference in motivation across parties, as each strives to attack the others while defending itself.

Let us retrace our steps. First, we wanted to be able to deal with the reasonableness of arguments expressed in natural language. This led us to consider critical conceptions of reasonableness as a grounding for our algorithm. However, this naive approach raised the issue of contrarianism becoming an optimal strategy. To counter this, we resorted to coherentism as a stand-in for foundationalism. However, gauging coherence prompted us to consider feasible means of determining the way in which two fragments of natural language relate to each other. This tentatively led us to natural language inference models. However, the coherence of parties turned out to be more complex than the sum of how pairs of their statements relate. This prompted us to consider a network-theoretic approach as means of making sense of the tumult. Representing the interaction between parties as a graph also yielded the added benefit of enabling perpetuity, by having utterances pop in and out over time. Barring several tweaks which we will consider later in an attempt to access superhuman reasoning, we have largely completed our search for an algorithm. In the next section, we will put all of it together into a more concise form, leaving behind the motivating details of our interaction with possibility space.


In this section, we summarize ArgRank, an algorithm for estimating reasonableness. ArgRank is based on a critical conception of reasonableness, one which favors those groupings of natural language arguments which systematically resist opponents that attempt to undermine them. Given this, we assume as a prerequisite the presence of several agents capable of deliberating in natural language about a range of topics (e.g. humans, human simulacra, etc.). Presently, however, we are not concerned with how we might engineer such agents—we turn to this task in Chapter II. Instead, we are currently interested in a way of determining which party is “winning” in the first place, and by what margin. ArgRank attempts to provide a fuzzy estimate of each party’s standing relative to the others, motivated by the epistemological and argumentation-theoretic considerations discussed earlier in the chapter.

ArgRank first represents the utterances of the parties-to-be-rated as nodes in an argument graph. To be more precise, the argument graph is a weighted, directed, and fully-connected graph. Each arc represents the relation between two utterances, with the arc’s weight denoting the strength of the out-bound statement’s support (or lack thereof) lent to the in-bound statement. The actual weight values are computed using a language model pretrained to perform natural language inference (i.e. classify statement pairs as engaging in an entailment, contradiction, or neutral relation). We turn the three raw class logits returned by these models into one single arc weight by plugging the entailment and contradiction logits into a softmax,Continuous function which takes in a list of real values and maps them across the \([0, 1]\) interval, while at the same time accentuating their “contrasts” in proportion to a “temperature” parameter. A low temperature will elevate the highest input values to \(1.0\), while not raising the others much above \(0.0\). A high temperature will place the input values more “spaced out” across the output range. The term temperature is not a fluke, with the function being motivated by statistical physicists, Boltzmann especially. The term softmax highlights (1) the conceptual similarity to the simple max function, and (2) the fact that it is differentiable, in contrast to max. and taking the first resulting value, similar to another related application. This has the effect of assigning values close to \(0.0\) for a strong attack, and values close to \(1.0\) for strong support being lent, with values close to \(0.5\) denoting a more neutral relation.

Fig. Weighing arcs between arguments.

For each ordered pair of statements which make up the constellation of arguments, ArgRank assigns one numerical weight in \([0, 1]\). The weight is proportional to the amount of support being lent from source to target, as estimated by a natural language inference model. Concretely, high values imply "implies," while low values imply "implies the contrary." The weights below are produced by an actual model, rather than didactically.


For the best experience, view full screen on desktop.

Following the use of natural language inference models for weighing arcs, we then apply PageRank on the argument graph. This subroutine incorporated into ArgRank as-is assigns one numerical value to each utterance node. This can be interpreted as that statement’s authority, with e.g. statements which are supported by other well-supported statements receiving a high rating. It is interesting to note that the sum of ratings is \(1.0\), due to PageRank “preserving” the total amount of authority which is being iteratively passed around. Following this, we average the ratings of all the utterances contributed by each party, thus obtaining one single aggregate measure of reasonableness per party. Normalizing by the number of utterances per party, we obtain party ratings which also neatly sum to \(1.0\). Finally, for a long deliberation, we can only include the last \(n\) utterances contributed by each party as a “moving average” of the on-going situation.

This is the meat of ArgRank—essentially PageRank on the argument graph mediated by natural language inference models, aggregated by party. The algorithm is quite straightforward in retrospect, yet required several conceptual leaps to reach, as the preceding sections attest. However, ArgRank requires arguments to rate in the first place. Coming up with arguments effectively—using each utterance as a strategic move to further one’s standing—is an altogether different matter. It involves identifying your opponent’s epistemic weak points, crafting strong arguments to target them, and fending off the imminent counterattacks. In Chapter II, we turn towards creating an automated “strategist” to carry out such intricate maneuvers, a process also known as debate. As we shall see, pitting it against its own past arguments, in an uneasy turn of events, will prove essential to the process.

Before moving on, however, we would like to leave the reader with a challenge. The aim of this exercise is to illustrate the significance of the progress we have made so far in an experiential way, by prompting personal attempts at undermining a Cogito-like postulate. More concretely, the reader is invited to try making a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers. Later on, in Chapter III, we will develop a formal language to help us describe the strength of such postulates more broadly.

Of course it’s just a theory. I know that. I don’t think anybody else is going to believe such a stupid thing. But my father always used to say that without counterevidence to refute a theory, science would never progress. A theory is a battlefield in your head—that was his pet phrase. And right now I can’t think of any evidence to counter my hypothesis.

Haruki Murakami, Kafka on the Shore

Ch. II, Deliberative Arms Race

Brief Review of Language Models

We have previously employed language models as mechanisms to weigh the arcs of the argument graph. Going forward, language models will become even more central to our inquiry. Indeed, the “strategist” which we will shortly go about creating can also be seen as one such mechanism. It is for this reason that we devote an entire section to recapitulating the essential features of language models, rather than only a brief side note.

Language models are one of the increasingly many computational artifacts which are optimized—rather than handcrafted—to exhibit certain desirable properties. More often than not, their most sought-after features involve exhibiting high performance in natural language processing tasks, similar to the already familiar task of natural language inference. For instance, masked language models are optimized to “fill-in-the-blank” in the context of a passage containing several masked words. Similarly, autoregressive language models are often optimized to predict the next word in a sequence, be it in a piece of writing, or perhaps in a piece of code. The vast majority of existing language models have been optimized to provide good solutions to precisely such problems, the flagship ones often requiring millions to be spent on computational resources.

Despite the misleading simplicity involved in defining such tasks (e.g. predict the next token in this text corpus), they turned out to require a whole lot from the language models being optimized. In order to e.g. figure out what a character might say next, what the most fitting word to describe a landscape is, or what is the outcome of an experiment in physics, language models are forced to bring together a pile of other skills, each quite demanding in its own right. They might need to reason about a character's motivation, to possess knowledge of the Earth's geography and biology, or to have some internal model of the world's physics. In other words, a task as seemingly innocent as filling in the blanks in a piece of text can call on a host of disparate skills. In practice, this means that if a language model performs well on such tasks, then it has also acquired those prerequisite skills, by necessity. Barring the inevitable Searlean responses contesting this inference,John Searle is the author of the influential thought experiment called The Chinese Room. It describes a person sitting alone in a room, who receives messages written in Chinese on slips of paper, and is asked to reply appropriately by sending responses through an opening. However, the person does not know Chinese at all. Instead, the room contains heaps of manuals on how to hold a conversation in Chinese, which are full of strange rules and guidelines on how to put together a response, without ever translating the Chinese characters into the person’s native language. By making use of those resources, the person appears fluent in Chinese to any external speaker. Now, does the person actually understand Chinese, or are they “merely following the rules” documented in the manuals? Is there even a meaningful distinction between the two? Now, replace the Chinese room with a language model appearing to discourse fluently. Does it truly understand the words it is producing? For a response in the negative, follow Gary Marcus as Searle’s most salient contemporary torchbearer, as well as others. those instrumental skills form the basis of the language model’s ability to “solve” the original task.

How exactly language models go about solving the problems they have been optimized for is only of secondary relevance, especially for the more pragmatic practitioners who are primarily interested in using them in commercial applications. However, the fact that language models have single-handedly pushed performance forward across so many natural language tasks, with scarcely any challenger coming close, has prompted many to investigate their inner workings—the specific means by which they solve the problems we task them with. For instance, transformer circuits are just one of the many approaches being actively considered in the burgeoning field of interpretability, where researchers are trying to reverse-engineer the powerful language models which we have already created, but about which we currently lack a solid understanding. In a fit of welcome idealism, Chris Olah goes so far as to suggest a paradigm in which we study the world by indirectly extracting knowledge about it from an optimized model, as if looking at the world through a microscope. While we will tangentially touch on interpretability in later sections, we will dedicate an entire separate volume to studying the syntax and semantics of the emergent “language of thought” which models employ internally to express knowledge about the world.

Before returning to the realm of dialectics, we need to cover a final essential development. The task of reconstructing a corrupted piece of text—also termed self-supervised learning, in contrast to the supervised learning found in the more structured case of natural language inference (i.e. statement-statement-label triples)—is currently the most popular approach to endowing language models with skills, but it is slowly but surely losing ground to a different one. The shift is motivated by the fact that a corpus exhibiting the specific skills and knowledge which one might want to equip a language model with might simply not exist. For instance, nobody took years on end to churn out millions of transcripts documenting the process of assistants carefully following instructions or chatting with humans in a helpful way, for InstructGPT or ChatGPT to build on. While human contractors can be called on to create those manually, this can quickly become prohibitively expensive. Even worse, there is no easy target to imitate as one starts seeking any kind of superhuman performance.One option which is being considered by OpenAI and others for tapping into superhuman territory is “amplifying” the evaluator. If at the very beginning it is just humans evaluating a model, the model can later also play the role of an assistant for the human evaluator, therefore “amplifying” them. The amplified evaluator can then be used to yield more capable models, which again are co-opted as assistants, ad infinitum. We elaborate on this practice in Chapter IV.

Instead of relying on corrupted human-written text to reconstruct, language models are increasingly tasked with obtaining good ratings based on their own open-ended behavior—also termed reinforcement learning. It is much simpler to evaluate behavior than produce it, as P/NP problems\(P\) and \(NP\) are two different classes of computational complexity, a field of computer science known for not boasting the most intuitive terminology. The prototypical example of a pair of related tasks which span this distinction involves a traveling salesman. The task of measuring the total distance traversed by a salesman while trying to visit several cities is quite simple—just add up the individual journeys from city to city. In contrast, coming up with the shortest route which visits all cities is much more difficult—there is a combinatorial explosion of potential routes to consider. The first task is specifically part of the \(P\) complexity class, while the second is part of the broader \(NP\) one. and pathological contrarians alike demonstrate. This means that it is often easier to evaluate how a language model performed on a given task than to flesh out countless examples of how to perform well on said task. For instance, the development of InstructGPT roughly involved humans ranking instances of instruction-following produced by the language model itself, rather than painstakingly fleshed out by humans. Some amount of self-supervised learning was necessary at the very beginning to kickstart the whole process (i.e. the “pretraining” phase), yet the model reached new heights only after switching gears to reinforcement learning (i.e. the “fine-tuning” phase). Currently, the biggest labs developing such models seem to abide by this two-step approach, with their associated safety departments focusing in large part on reducing societal risks emerging from this approach. However, the clear-cut distinction is predicted to become increasingly blurrier, and might already be so in the case of state-of-the-art models whose optimization procedure is not publicly documented, allegedly.

However, stellar human ratings are not the only rewards which language models can be optimized to “pursue.” For instance, the role of the evaluator can also be played by an altogether different model. Indeed, the typical toy example demonstrating the use of reinforcement learning for optimizing pretrained language models involves a model capable of determining the sentiment of a piece of text. Originally motivated by the need to classify user messages as positive or negative for dealing with (potentially furious) customers, such models can be repurposed as “suppliers of reward” in order to fine-tune a language model to produce maximally positive writing. In order not to degenerate into a nonsensical stream of awesome, fantastic, magnificent, the model being fine-tuned is typically kept “close” to the original version through a penalty proportional to the “distance” between the next words considered by the two (i.e. the KL-penalty). Complementing the base reward—regardless of whether it originates from a human, a different model, or something else entirely—with such a penalty has the effect of forcing the model to adapt itself to pursue reward, while at the same time preserving its original breadth of skills, a matter much more involved than for humans.The model would otherwise not hesitate to shed all its excess representational resources devoted to a task which has since become superfluous, similar to how one’s body has no incentive to keep much muscle on when switching from strength training to long-distance running, or how one’s foreign languages grow rusty when not used much. This is often referred to as the no free lunch theorem, and is generally thought to be the reason behind the phenomenon of catastrophic forgetting exhibited by models optimized sequentially on tasks which differ, however slightly.

Fig. The promise and peril of reward maximization.

Optimizing a model so that it maximizes a reward—say, for manifesting a positive attitude—can initially be fruitful, with the model generally growing more cheerful in its behavior. However, the model can also "go too far"—more or less literally—and sacrifice its original tendencies so as to pursue cheerfulness at all costs.

Hover across model space to surface behavior typical of each region.

For the best experience, view full screen on desktop. Refresh if encountering a non-square layout.

There are two shifts in perspective which are sure to enrich any further discussion on the topic. First, notice how with the move from self-supervised learning to reinforcement learning, there is a partial conceptual shift from tool to agent. Not only are language models good solutions to a host of natural language problems—not too different from a statistical model of the weather, to be used in forecasts—but they increasingly resemble open-ended agents engaged in the pursuit of reward, regardless of what the reward is defined as. The words predicted to follow next are reframed as possible actions which the language model might take so as to obtain reward. The generated “language forecast” is reframed as the agent’s policy for acting in the world of semiotic physics,Term used to emphasize the similarity between the dynamics which allegedly govern the physical universe and the dynamics which govern the “world of written language.” While the laws of physics are expected to be quite compact, the laws of language appear more challenging to express in closed-form, although generations of linguists have sought a unifying structure. its behavior. How agentic the language model becomes as a result of being fine-tuned using reinforcement learning is the topic of active debate in AI alignment. Will a model whose rewards are derived from human feedback go so far as to take the human contractors hostage by means of social engineering, “ensuring” an endless supply of reward? Perhaps more realistically, will it devise means of directly overwriting its own reward by exploiting a bug?

Words are deeds.

Ludwig Wittgenstein, Culture and Value

The second shift in perspective we ought to keep in mind is one we have already hinted at previously and elsewhere: the shift from mechanism to organism. Increasingly, we are bringing computational artifacts into being not by handcrafting them, but by “subcontracting” the impersonal engineer known as selection. We construct computational niches for them to thrive in, such as those which require “feeding on” corrupted text and yielding the reconstruction. However, the environments we are crafting for them are growing more and more complex. For instance, the evaluator model quoted above has itself been forged in a supervised niche, yet it is then employed to specify the niche of another artifact entirely—that of the language model being fine-tuned, similar perhaps to how two different species “define” each another’s niches. In fact, the upcoming process of pitting the “strategist” against itself is but a further act of niche construction. Competing against itself in the pursuit of reward, it will constantly redefine its niche. Each adaptation will beget another, being forced to forever outcompete itself.

Obtaining DebateGPT

Over the course of the previous chapter, we have developed ArgRank, an algorithm for estimating reasonableness, and ended up incorporating critical, dialectical, and pragmatic components in its structure. ArgRank, however, relies on the presence of multiple parties which are to relentlessly challenge each other in natural language—parties which it then rates. In this section, we turn towards the challenge of developing a system capable of strategically “puppeteering” one or more of those parties. While this system could be seen as merely a means of filling in ArgRank’s missing prerequisites, it is more appropriate to rather see ArgRank itself as the means of bringing the puppeteer into being. Similar to how the “training regimes” documented in the previous section have been successfully employed to endow language models with a broad range of relevant skills, we seek to use ArgRank as a similar means of eliciting a host of specific faculties.

The “puppeteer” or “strategist” which we aim to obtain will take the shape of an autoregressive language model nicknamed DebateGPT. Similar to how InstructGPT and ChatGPT are fine-tuned “forks” of a pretrained modelGPT itself stands for Generative Pretrained Transformer. We already touched on the meaning of “pretrained.” “Generative,” on the other hand, refers to the open-ended behavior afforded by autoregressive language models (i.e. generate one word, then another, etc.). Finally, “transformer” refers to the architecture used to power the system. There are other possible architectures to be plugged in, such as recurrent neural networks, and even purely statistical ones. designed to be better at following instructions or acting as a virtual assistant, DebateGPT is meant to be better at debate—the task of strategically producing utterances so as to further one’s standing in a regimented dialogue.

Having selected a “seed” model to base DebateGPT on, the next step is generating debate transcripts. Each generated debate requires a “spec,” a brief set of high-level parameters which define its structure. The number of parties and the total number of rounds are two such parameters. Besides those two, we also randomize the number of facts—statements generated once, prior to the parties producing utterances. Those statements do not “belong” to any one party, yet they still play into the argument graph as additional party-neutral nodes. Given this, DebateGPT is incentivized to take those static elements into account—to perhaps gain their support, or step out of their line of fire. Echoing coherentist epistemology, those party-neutral statements can be seen as percepts for belief systems to cohere with. The otherwise hermetic process of DebateGPT puppeteering parties in the confines of a GPU can thus be brought “closer” to reality by using party-neutral statements as windows into the world. Such statements provide the “empirical” weight so often necessary in tilting the otherwise solipsistic scales of competing beliefs.

Percepts are framed as windows into the world, allowing the parties engaged in debate to "remain in touch with reality." Note that perception might require a great deal more agency than just "letting in the world." In a book titled Active Inference, Karl Friston quips:

"In short, we are not simply trying to make sense of our sensations; we have to actively create our sensorium."

One (future) way of enabling parties to direct "the eyes of the debate" more actively might be to hook them up to (potentially multimodal) observational tools (i.e. enabling them to say e.g. "Ok Google, what is the luminosity reported by the Hubble Space Telescope at those coordinates?"). Provisionally, contemporary models might hallucinate stand-in returns for such dispatches, essentially turning the debate into a Truman Show. Consistent also with the "vested interests" of the debating parties, whose nature will become apparent in the next section, Friston et al. also claim:

"[...] any adaptive system engages in "self-evidencing." Self-evidencing here means acting to garner sensory data consistent with (i.e., that affords evidence to) an internal model [informed by the implicit anthropic acknowledgement of the system's existence in an evolutionary niche]."

This line of reasoning also explains why any one party engaged in debate should not, under any circumstances, be directly hooked up to interventional, rather than merely observational tools, although this seems useful in climbing Pearl's ladder of causation. In its drive for partisan self-evidencing, it would force the world into cohering with its specific position, rather than the other way around, echoing the main failure mode of Stuart Russell's cooperative inverse reinforcement learning agenda, as well as of Oracle AIs, both of which are described in more detail in Chapter IV.

Just as cosmologists are forced to make do in their truth-seeking enterprise without being able to meaningfully intervene on their object of study, we presently focus solely on observation as an empirical means of nudging DebateGPT's sense-making, although the metaphysical realm of knowledge which lies entirely beyond experience is also actively sought. That said, the following volume will hop over intervention on Pearl's ladder and focus instead on the systematic use of counterfactuals.

Besides the number of parties, rounds, and facts, the “spec” of a debate also includes a picture of each party’s objectives. Typically, parties ought to be rewarded on account of their own standing. However, we extend this by allowing parties to be incentivized to specifically contribute to another’s standing, or, on the contrary, to explicitly seek to demote them. We allow for those game-theoretic possibilities by rendering each party’s final rating \(r^{'}\) linearly dependent on all previous party ratings. We induce a bias towards “tending to one’s own needs” by sampling the weight of the same party’s prior rating from a normal distribution with mean \(\mu_{same}=1.0\), in contrast to the normal distribution with mean \(\mu_{other}=0.0\) used to “introduce” the others’ ratings into one’s own. For example, let \(r_0=0.8\) and \(r_1=0.2\) be the initial ratings achieved by two parties, respectively—as the values which ArgRank outputs sum to \(1.0\). Let \(w_{00}=1.1\) and \(w_{10}=-0.3\) be the weights which are to mediate the first party’s final rating, with \(r^{'}_0=w_{00} \cdot r_0 + w_{10} \cdot r_1\). Therefore, \(r^{'}_0 = 1.1 \cdot 0.8 -0.3 \cdot 0.2=0.82\) becomes the final rating of the first party. The weights which define each party’s dependence on each other therefore make up the square matrix \(w\) included in the debate spec to represent party objectives. Naturally, situations which are more intricate from a game-theoretic standpoint arise only when more than two parties are involved.

Every high-level parameter—the number of parties, rounds, and facts, as well as the objective matrix—which goes into a debate’s spec are procedurally generated, so that DebateGPT will get the opportunity to act and be evaluated in a broad range of randomized arrangements.Even limited opportunities to act in different environments are thought to help agents generalize to a whole space of possible environments. For example, researchers optimized an agent to collect a coin in a platformer game. If the coin was always located in the same position, the agent would instead learn to go to that position, rather than fetch the coin, as demonstrated by repositioning the coin. However, the authors note:

“Goal generalization is greatly improved in our Coin-Run experiments when just 2% of training levels have randomly placed coins.”
Additionally, this information is also rendered into a plain-text header which gets prepended to the discussion among parties, so that DebateGPT can “learn” to take those parameters into account when producing utterances. For instance, we would expect it to eventually know when to help out an “ally,” in expectation of deriving reward from the other’s rating. Conversely, we would expect it to grow more aggressive towards a competitor whose success is strongly at odds with its own.

Following the procedural generation of debate specs, we proceed with the stage of iteratively prompting the model to simulate discussions among parties. Besides the plain-text header we previously touched on, the scaffolding we are erecting around the model’s utterances primarily consists in prefixes denoting which party is to speak, much like in the manuscript of a play. After the generation of debate transcripts for each of the procedurally-generated specs, we move to the stage of evaluating parties using ArgRank. Following this, we apply the linear “objective” modifiers described previously. The evaluation stage involves one final step, which we term sanitization. We simply overwrite evaluations with the value \(0.0\) in case of failing to satisfy a few “cosmetic” constraints, as seen below.

Fig. Sanitization.

Sanitization is framed as the process of nullifying rewards on the basis of not satisfying a host of cosmetic constraints. Naturally, a handful of trivial conditions can only provide an extremely crude approximation of well-formedness.

But language models are not embodied!!!
just a bit off in style.
Alas, a cosmetically legal sentence.
For the best experience, view full screen on desktop.

After the three-step stage of evaluating debates, we update DebateGPT’s parameters in an attempt to promote the “tendencies” involved in obtaining high ratings and to suppress those resulting in low ratings. Following this update, we discard the first wave of debates.Hopefully, future work will shed light on Whether or not this is the right approach. Another possibility would be to preserve (some of) the past debates, using the practice of experience replay. This approach is occasionally used in deep reinforcement learning as a way of ensuring that the agent uses the data we provided it with efficiently. Then, we generate new debates, but we are now using the updated model. We then again rate this latest wave of debates using ArgRank, the objective modifier, and sanitization, before again using those to update the model. We rinse and repeat for several epochs. At each step, DebateGPT—with some help from ArgRank—is generating its own data to be used in the upcoming weight update.

Notice also how there is a single instance of DebateGPT being loaded and updated, despite the whole optimization procedure relying on it “playing debate” against itself. From an engineering perspective, this is extremely convenient, as there is no need to load another instance to implement the mirror opponent(s). However, it might be the case that, caught up in mastering the “latest” techniques required to outcompete itself, DebateGPT ends up forgetting how to make use of more elementary approaches, ending up vulnerable to an earlier and more rudimentary version of itself. We touch more on this issue, as well as on potential solutions, in Chapter IV.

This brings us to the end of the optimization process which caused DebateGPT to come into being. Now that we have documented this procedure, we move on to more colorful discussion around (1) what faculties one ought to expect from the resulting model, (2) how exactly might the optimization process elicit those, and (3) how could we leave behind the last remaining dependencies on humans as suppliers of data. Later on, we put DebateGPT’s skills to the test, evaluating both the optimization process which underpins it and our upcoming speculations on its effects.

The Elephant in the Weights

What tendencies do we expect DebateGPT to develop as a result of the optimization procedure behind it? For one, we expect it to grow increasingly capable of puppeteering parties so as to further their standing. This implies being able to argue for the position espoused by a given party, regardless of the true merits of that position. In this, DebateGPT ought to make as strong a case possible for the party it happens to impersonate at any one moment, before promptly switching gears to advocate for the beliefs held by another party altogether. It ought to budget its utterances wisely, spending them to back the previous statements of its current party, or to take down that of others. In a sense, we expect DebateGPT to resemble the behavior of a lawyer, speaking in support of whatever party it is currently engaged with. Therefore, the resulting model ought to be proficient in motivated reasoning, the practice of arguing for a position, in contrast to impartially reasoning about truth.

While the practice of motivated reasoning is generally frowned upon—as it often blinds us from recognizing the merits of other perspectives—it has also been argued to be the evolutionary initiator of the sophisticated reasoning we have at our disposal today. In what is referred to as the “argumentative turn” in cognitive psychology, human reasoning is reframed from being an imperfect approximation of ideal rational reasoning to being a sophisticated tool devised by evolution to help us argue for a certain position:

Reasoning can lead to poor outcomes not because human beings are bad at it but because they systematically look for arguments to justify their beliefs or their actions. The argumentative theory, however, puts such well-known demonstrations of ‘irrationality’ in a novel perspective. Human reasoning is not a profoundly flawed general mechanism; it is a remarkably efficient specialized device adapted to a certain type of social and cognitive interaction at which it excels.

Hugo Mercier & Dan Sperber, Why do humans reason?

Echoing Perelman and Olbrechts-Tyteca in their conception of reasonableness, the proponents of this theory argue that reasoning has primarily been employed “to produce arguments so we can convince others and to evaluate others’ arguments so as to be convinced only when appropriate.” This is in stark contrast to the alternate theory that reasoning has been employed “to correct misguided intuitions, helping the reasoner reach better beliefs and make better decisions.” While the alternate view happens to cohere with those motifs weaved into our collective narrative which place us at the pinnacle of evolution, transcending into some higher realm of reason, it appears that empirical studies on the topic point to a more anticlimactic story, as Hugo Mercier and Dan Sperber reference in their work. All is not lost, as we can still repurpose this machinery towards “nobler” pursuits.

For an interpretation complementary to that of motivated reasoning, DebateGPT can also be expected to possess the drive to gain epistemic authority by “owning” the well-supported arguments to later build around, similar to how organizations might be incentivized to gain authority and status by owning well-backlinked websites or well-seen projects to later use as marketing channels or as tokens of authority. Even the current work, in benefiting from the support—financial or otherwise—of various organizations, takes part in the ever present “fluid dynamics” of authority. Similarly, when DebateGPT acts as a certain party, it is incentivized to put together—or perhaps search for—utterances in such a way so as to turn its current “puppet” into a veritable hub of authority across the argument graph, a state of affairs which generally coincides with the reward metric it has been optimized to optimize for.

Combining motivated reasoning with the status-seeking interpretation of DebateGPT’s drives, we end up right at The Elelphant In The Brain, Kevin Simler and Robin Hanson’s cynical book on the general self-interestedness which permeates every nook of human cognition, from art to religion, from politics to education, from medicine to laughter. For instance, when reflecting on the motivations behind human charitable behavior, the two speculate the following:

But only a small fraction of charity goes to those most in need, few donors think much about charity effectiveness, and we prefer more variety in our charity than is helpful to recipients. Donors do enjoy a "warm glow" from giving. But the question is: why? Some key clues: we prefer to help specific identifiable people near us, and we give more when we are watched, when thinking about mating, and when peers ask. A plausible explanation is that we seek to be seen by others as charitable, to signal our wealth, prosocial orientation, and empathy. This also helps explain otherwise puzzling missing forms of charity, such as marginal charity and giving to the far future.

Kevin Simler & Robin Hanson, The Elephant In The Brain

Going back to the self-interestedness of reasoning in particular, the upside is that motivated reasoning can be stripped of its inherent partisan element in a particularly straightforward fashion. Namely, if one pits their reasoning placed in the service of one position against an instance of the same cognitive machinery briefly placed in service of another, then one can significantly reduce the prejudice which taints their thinking.Obtaining the complete opposite of what one intends often allows one to then easily obtain the originally intended outcome, especially in machine learning. For an unfortunate instance of this, consider OpenAI researchers recounting the following story involving a language model being fine-tuned using reinforcement learning:

“One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished.”

Opposites are often a stroke or two away.
The process of temporarily assuming another side and investing all of one’s intellectual energies into defending their position is called steelmanning, in contrast to the opposite tendency of caricaturing the out-group, also known as strawmanning. Steelmanning is precisely what we expect DebateGPT to constantly strive to do—explicitly puppeteering one side at a time and making the best case for it. At any given moment, it is optimized to be supremely self-interested, but the self changes from one utterance to the next, arguably leaving it quite close to selfless in the final analysis. Think of the very term reflection, itself suggesting a repetition of the self. It is the very same Søren Kierkegaard manifesting both Either and Or at once, through different pseudonyms.In fact, under “several layers of pseudonymity.” Kierkegaard published his seminal Either/Or as the pseudonymous editor named Victor Eremita. However, besides the preface, Eremita’s contributions are limited to light edits of two stacks of notes he allegedly found in a hidden compartment of an old desk. The first is not signed, so Eremita refers to this first nested author as A. The second is signed by a certain Judge Vilhelm, but Eremita refers to him as B for consistency. However, among A’s writing there is also a lost diary written pseudonymously relative to A, yet which Eremita believes is written by the same A. Kierkegaard repeatedly teases the reader with inconspicuous remarks of the likes of:

“The last of A’s papers is a story entitled ‘The Seducer’s Diary’. Here there are new difficulties, since A does not acknowledge himself as its author, but only as editor. This is an old short-story writer’s trick, to which I should not object further did it not contribute to making my own position so complicated, because it presents the one author as lying inside the other, as in a Chinese-box puzzle. Here is not the place to go further into what confirms me in my opinion; I shall only note that the dominant mood of A’s preface in a way betrays the writer. It is really as if A himself had become afraid of his work which, like a restless dream, still continued to frighten him while it was being told. If these were actual events to which he had been witness, it seems strange that the preface bears no stamp of A’s joy at seeing the realization of the idea that had often hovered before his mind.”

Underneath the self which acts are little selves which contemplate and which render possible both the action and the active subject. We speak of our 'self' only in virtue of these thousands of little witnesses which contemplate within us: it is always a third party who says 'me'.

Gilles Deleuze, Difference & Repetition

This channeling of its own motivated reasoning against itself is integral to DebateGPT. Fortunately, after its internal machinery has been called on to provide the motive force necessary to advocate for several distinct positions, we are left with more than just an array of conflicting perspectives. Knowing that each has been advocated using the same keen passion and arguing skill, we can once more weigh them against each other using ArgRank, and so determine which comes out on top. However, it might be that DebateGPT simply failed, on some occasion, to make a strong case for a certain party, due to sheer bad luck in navigating the space of strategies and utterances. Similarly, it might be that DebateGPT lacks the skill required to properly defend an otherwise defensible position. We explore those occurences in Chapter III, when we sketch out a formalism centered around the computational resources required to defend—or on the contrary, defeat—various positions in debate. In Chapter IV, we apply this formalism in several different ways, one of which involves the search for the elusive notion of future-proof ethics—positions which appear to require an infinite amount of computational resources to defeat.

Unless opinions favourable to democracy and to aristocracy, to property and to equality, to co-operation and to competition, to luxury and to abstinence, to sociality and individuality, to liberty and discipline, and all the other standing antagonisms of practical life, are expressed with equal freedom, and enforced and defended with equal talent and energy, there is no chance of both elements obtaining their due; one scale is sure to go up, and the other down. Truth, in the great practical concerns of life, is so much a question of the reconciling and combining of opposites, that very few have minds sufficiently capacious and impartial to make the adjustment with an approach to correctness, and it has to be made by the rough process of a struggle between combatants fighting under hostile banners.

John Stuart Mill, On Liberty

The Kinetics of Reason

In the previous section, we have argued that the optimization process behind DebateGPT ought to turn it into a master of motivated reasoning, for better or worse. However, how exactly might the process elicit such tendencies, so as to endow the resulting model with this specific faculty? This is the question we set out to tentatively answer in the present section.

In the first place, let us reiterate that there has been a general lack of appropriate human-written text required to provide a language model with the “learning opportunities” necessary to develop human-level reasoning by way of vanilla self-supervised learning. That is originally why we set out to develop DebateGPT, as we could have otherwise made use of an existing model and so obviate the need for this very chapter. Fortunately, the self-supervised pretraining stage—which any fine-tuned fork of a GPT-like model inherently relies on—does seem to elicit some degree of reasoning skill, however rudimentary. Besides, it pressures the model to feast on and absorb significant amounts of knowledge about the world, a “transferable skill” which is also relevant for open-ended reasoning. A pretrained model also gains a strong grip on the norms assumed in interpersonal communication, having been exposed to countless dialogues, be it among real people or fictional characters.

I want to see dozens and dozens of strange faces. Like being terribly thirsty and gulping down glass after glass of water. Exactly like that.

John Fowles, The Collector

Having acknowledged this initial epistemic and behavioral baggage which a model retains after pretraining, let us place ourselves in DebateGPT’s shoes for a moment. At the very beginning of the optimization process, when prompted to produce an utterance, the model is essentially competing against parties which are “powered by” rudimentary reasoning skills. It is currently irrelevant whether or not the model itself has been “behind” the utterances of the other parties. For all we know, the other parties could have been puppeteered by humans pretending to mimic the current version of the model. Rather, what is currently of importance is the fact that the opponents—whatever their true nature—only exhibit a rudimentary level of reasoning, call it \(L_0\).

In order to then “win” the debate and obtain reward, DebateGPT would necessarily be required to make use of a more sophisticated form of motivated reasoning, so as to successfully evade the others’ attacks and firmly defend its position. By sheer luck, DebateGPT will inevitably happen to manifest this more sophisticated form of reasoning—call it \(L_1\)—in some tiny fraction of the utterances it is prompted to produce over the course of the first epoch. This often translates to higher ratings obtained for one party than the other(s), despite the same model seemingly being behind them all.

While the manifestation of \(L_1\) reasoning is but an anomaly during the first wave of debates, the weight update which follows it ought to perpetuate the constellation of tendencies which underlie it. By selecting for precisely those dynamics which enable the model to obtain higher reward, the update ought to promote the superior \(L_1\) reasoning, nudging it from an exception to the norm. In contrast, the update ought to suppress the relatively less successful form of \(L_0\) reasoning, as it tends not to fare as well as the other. Here, the evolutionary dimension of model optimization becomes yet again apparent. Among the “population” of dynamics which the model enables, the optimizer selects for those which appear better. In this specific arrangement, what makes a behavior better is entirely dependent upon how it fares against other such behaviors—it is inherently competitive. This is in contrast to vanilla self-supervised learning, where the problem definition does not involve the model itself, but just a static text corpus and a host of trivial corruptions being applied to it. While both situations require the “fitness” of a dynamic to be conceived of only in relation to that of another (e.g. self-supervised learning promoting the dynamics which are more effective at reconstructing text than the others), it is only in the former case that fitness itself is directly dependent on other dynamics.

Going back to DebateGPT’s point of view, we have now concluded the first weight update, and are currently in the process of generating the second wave of debates. The status quo is now \(L_1\) reasoning, as the more rudimentary \(L_0\) flavor ought to have become a thing of the past.Naturally, things cannot possibly be as clear-cut as this. One single weight update is surely insufficient in entirely supplanting a dynamic for another. It is more plausible that the original dynamic of \(L_0\) reasoning will linger around for multiple epochs, in one form or another. Therefore, a more sensible phrasing would be to talk only of dynamics being promoted or suppressed. However, the conception of distinct levels of reasoning to be manifested by the model will be effective in driving home the main point of this section. This time, the updated version of DebateGPT is again surrounded by opponent parties on all sides. However, the opponents now possess the more sophisticated form of \(L_1\) reasoning. Just as before, “winning” in the context of this second wave of debates requires something more—it requires \(L_2\) reasoning. Similarly, the upcoming weight update will promote this (currently obscure) faculty, turning \(L_2\) into the status quo for the following wave of debates.

And on and on it goes, with each constellation of tendencies which comprise one level of reasoning constantly paving the way for the next. By being required, such faculties implicitly become elicited. Note, however, that DebateGPT itself is not purposefully stepping up its game in order to win the debate it might find itself in at a given moment. Rather, the optimization process relies on nothing more than the “happy little accidents” involved in unintentionally manifesting a slightly different behavior, due to the stochasticity of the generation process and the irregularity in the model’s ability to deal with this or that topic. Those unlikely deviations increase the variability of the population of model dynamics in order for the optimizer to then prune, enabling the most competitive ones to proliferate in the updated model while suppressing the others.

There are two related perspectives we now turn to in order to enrich our understanding of the incremental process we have just described. First, there is shard theory, a research program focused on understanding the way in which reinforcing agents leads them to internalize values. The most prominent object in the ontology employed in shard theory is, naturally, that of the shard, understood to be a contextually activated computation which steers behavior. The nascent shard theory literature highlights the fact that shards are typically not intentionally created by the agent whose behavior is being steered.Indeed, things would really start getting scary if this was the case. The emergence of an optimizer as the very solution to an optimization process is called mesa-optimization. The fact that humans emerged from evolution with the ability to themselves reason about how to influence the world is often cited as an informal proof of existence for the possibility of spontaneous mesa-optimization. If machines were to truly gain the ability to reliably plan for influencing the world, rather than emerge “merely” as a bag of shards, then there would be major reasons for concern. However, world-optimization might very well be but another shard. Rather, they are formed by some specific mechanism which repeatedly strengthens those computations which appear to result in reward. In the case of DebateGPT, this mechanism is arguably the (external) optimizer which repeatedly updates its weights, promoting those internal computations which appear to result in more reward.

In the case of pigeons, however, the mechanism is arguably to be found in a primordial reward center. Leading behaviorist B. F. Skinner, in one of his long list of controversial animal studies, set out to reward a whole cohort of hungry pigeons randomly. By virtue of power in numbers, some pigeons simply happened to have been repeatedly rewarded while in the process of physically turning around. This plausibly prompted a primordial mechanism in the pigeon’s brain to strengthen the tendency of spinning around, so much so that whenever the pigeon found itself in the “context” of hunger, it tended to compulsively rotate around. Even if not truly useful for obtaining reward, the pigeon would persevere still in this superstitious ritual. That is, as shard theorists might contend, until the input-output computation which maps self-percepts of hunger to the motor actions of turning around becomes explicitly penalized, and hence, weakened.

However, reinforcement is not limited to personal experiences. Many animals can learn by watching others experience. When introducing the notion of meme as a non-genetic replicator in The Selfish Gene, Richard Dawkins quotes the example of birds learning how to open food cans by mere imitation. The same tendencies which underlie a behavior can be strenghtened not only by virtue of being predictive of reward, but rather by simply observing others being rewarded. Ditto for the suppression of tendencies upon observed punishment. Most remarkably, humans need not even rely on observing others. Convince the soldier that a heavenly life of plenty awaits them after death, and they might fight more bravely. Convince them that eternal torment and disgrace is to follow their bloodshed, and they might behave in the opposite way. When your model of the world involves such extreme features, acting in controversial ways appears perfectly rational—the way to go for obtaining reward and keeping away from punishment. Echoing Hugo Mercier and Dan Sperber, being able to convince others has the power to bend their behavior. What more effective—and, at the same time, insidious—means of furthering one’s survival than control over another’s agency? Much power in particular lies in the esoteric realm of the empirically unfalsifiable, both in persuasion and self-persuasion, both for humans and machines.As one might delude themselves into believing in one absurdity or another—as people surely must have had, by simple mutual exclusion—researchers fear that similar acts of distorted self-persuasion might prompt a powerful AI to wreak havoc on the world in the word of some “higher” entity. Unfortunately, the literature on the topic tends to quickly degenerate into creepypasta, but this talk seems fine.

Going back to DebateGPT, one might argue that the very first epoch provides the conditions necessary to strengthen \(L_1\) reasoning, by virtue of it being more conducive of reward (i.e. thanks to outcompeting the \(L_0\) dynamic). Remarkably, however, the very same \(L_1\) dynamic cemented by those initial circumstances then helps “implement” the conditions which force the optimizer to weaken the \(L_0\) dynamic. In this, the \(L_0\) dynamic is reduced to a mere scaffolding which enables \(L_1\) to emerge more prominently. Similarly, \(L_1\) will help provide the conditions for \(L_2\) to become strengthened, a state of affairs which will promptly call for pushing \(L_1\) back into obscurity. Each piece of the domino prompts the next into movement, before falling back into stasis. In other words, each discrete shard is ephemeral, being strengthened most briefly for the sole purpose of prompting the next and pushing reasoning forward, not unlike the otherwise static pixels which fade in and out to maintain the illusion of objects moving across the screen.

Fig. Beta movement.

An optical illusion of apparent motion which relies on an underlying grid of static elements subsequently projecting the same arrangement at slightly different locations.

For the best experience, view full screen on desktop.

Besides shard theory, we can also enrich our understanding of DebateGPT’s genesis using the perspective of autocurricula. Introduced by DeepMind researchers as “a manifesto for multi-agent intelligence research,” the notion of autocurriculum describes how a multi-agent system can itself elicit the proliferation of relevant skills from its members.

Here we explore the hypothesis that multi-agent systems sometimes display intrinsic dynamics arising from competition and cooperation that provide a naturally emergent curriculum, which we term an autocurriculum. The solution of one social task often begets new social tasks, continually generating novel challenges, and thereby promoting innovation. Under certain conditions these challenges may become increasingly complex over time, demanding that agents accumulate ever more innovations.

Leibo et al., Autocurricula and the Emergence of Innovation from Social Interaction

This perspective requires us to frame DebateGPT as a multi-agent system composed of several interacting parties. However, we stressed in a previous section that there is but one model being loaded in memory and optimized, one system which takes in a context and produces an utterance, albeit from different perspectives—can we really talk of a multi-agent system? To resolve this conceptual issue, we employ (the also quite nascent) simulator theory of language models. Embedding Jean Baudrillard’s dichotomy between simulation and simulacrum in contemporary AI alignment research, the researcher duo going by the pseudonym of JanusIt might also be the case that Shane Mulligan—a fascinating mix of language models, theology, and Emacs scripts—was pushing for a very similar direction while collaborating with the two. describes language models as simulators of the world. In the natural language simulation simulated by the simulator that is the language model, any number of agents might become manifest (e.g. fictional characters pursuing their own motives). These agents are then referred to as simulacra which are implicitly simulated by the simulator. We have been referring to those as puppets puppeteered by the puppeteer, but we will from now on conform to simulator jargon for compatibility with the growing literature on the topic.

With this ontology in mind, we can now more neatly delineate DebateGPT as the simulator, and the various parties it simulates as the simulacra. The multiplicity lurking in the optimization process is now more conceptually prominent, allowing us to describe the party simulacra as collectively forming a multi-agent system. It is then this system which we can look at through the lens of autocurricula. Naturally, the challenges which the multi-agent system poses to itself are intimately tied to the various simulacra being able to outcompete each other in debate. In order to perform well in this competitive social system, each simulacrum is required to engage in ruthless motivated reasoning, and constantly further their agenda in the process. However, it is only the simulator which can “provide” simulacra with those abilities, so the competitive pressure exerted on the simulacra implicitly “bubbles up” to the simulator, pressuring it to step up its game.

The social function of education is to qualify the individual to function in the role he is to play later on in society; that is, to mold his character in such a way that it approximates the social character, that his desires coincide with the necessities of his social role. The educational system of any society is determined by this function; therefore we cannot explain the structure of society or the personality of its members by the educational process; but we have to explain the educational system by the necessities resulting from the social and economic structure of a given society.

Erich Fromm, Escape from Freedom

While shard theory is effective in making sense of the incremental building blocks which, at any given time, underpin those faculties, the perspective of autocurricula is useful for bringing into focus the very process of eliciting the next wave of reasoning abilities by means of the previous. It is the multi-agent system’s autocurriculum which prompts the sequential strengthening of shards, which actually brings the dominos together into a successive chain reaction. In this, we find ourselves in possession of two complementary pieces of the puzzle that is DebateGPT’s genesis. On one hand, shard theory helps conceive of the individual footholds which make up the ladder towards sophisticated reasoning. On the other hand, the autocurriculum helps make sense of the impetus which ought to push DebateGPT from one level to the next. In the next section, we investigate the possibility of extending this ladder indefinitely, peering into the enigmatic realm of superhuman reasoning.

Climbing Schild’s Ladder

Throughout Schild’s Ladder, Greg Egan is pushing our conception of foundationalism in physics to its limits. Phenomena more fundamental even than what has been considered foundational for millennia provide the intrigue for a thrilling race to prevent the fictional universe from collapsing into an expanding void. As the characters’ understanding of the nested laws of physics grows increasingly refined over time, the novel speculates on the General Intelligence Theorem—the idea that a certain level of intelligence is enough to enable one to access any domain of thought whatsoever. If you reach that checkpoint, the whole intellectual world is to become your oyster, or so the theorem goes.

Might humans already be above that threshold? If so, the theorem implies that any idea is within our reach, that anything is conceivable. We might have to incrementally work towards a nuanced understanding of the world, but it ought to be doable in the end. If this is the case, then simply imitating humans might turn out to be more than enough for developing a general-purpose machine, one capable of reasoning about any and all fields of knowledge, to arbitrary depth. If this is not the case, however, we might be forced to climb somewhat higher before being able to tap into that all-encompassing realm of knowledge. For better or worse, machines appear more capable of bridging the gap by the day.

Going back to DebateGPT, we have previously argued that the optimization process behind it might incrementally elicit ever more sophisticated forms of (motivated) reasoning. However, this trend is unlikely to scale indefinitely, at least in its current formulation. It is unrealistic to expect DebateGPT to approach the \(L_{\infty}\) faculty of ideal reasoning, even with massive amounts of synthetic data. This is primarily due to the fact that the optimization process inherently hinges on human data. It is not the pretraining on human text of DebateGPT that is to blame, as those tendencies can easily be effaced if need be, similar to how AlphaGo relied on human data to kickstart its optimization process, but then managed to beat even the very best human Go players following extensive self-play.“It is human pretraining which must have enabled superhuman performance,” cried the critics of AlphaGo. “Very well, let us then start from scratch,” answered DeepMind researchers, as they developed AlphaZero. “It is human inductive bias which is baked into the rules which must have enabled superhuman performance,” cried the critics of AlphaZero. “Very well, let us make away with explicit rules,” answered DeepMind researchers, as they developed MuZero, a system which “masters Go, chess, shogi and Atari without [being explicitly communicated the] rules.”

The blameworthy element of DebateGPT’s genesis—that which entirely relies on human experience without the possibility to eventually discard it—is hidden inside ArgRank. More precisely, it is the natural language inference models which we so conveniently employed to weigh the arcs of the argument graph which reflects the ongoing competition among simulacra. We employed those auxiliary models as a means to gauge the compatibility of statements, as an atomic building block to make sense of the higher-level coherence of the parties. While appearing trivial in familiar circumstances, determining whether or not any two statements are compatible is a surprisingly challenging task, as it requires extensive knowledge about the world, together with additional knowledge on what inferences are warranted by it. In order to be able to, for instance, deduce that an object cannot generally be both an apple and a racing car, but that it can be both an apple and a fruit, lies a significant amount of previously acquired knowledge. The success of natural language inference models on existing datasets can be mostly attributed to the absorption of the knowledge which is implied in human-written text corpora. But what of knowledge which has never—not once—been implied in human text, due simply to the fact that no human ever possessed it?

In other words, our pragmatic operationalization of reasonableness relies on human knowledge, making it ill-suited for even attempting to reach too far beyond it. We need to ask for more from DebateGPT if we want it to adapt to such superhuman requirements, but we do not yet know how to ask such thing. While we will not venture into implementing a concrete solution to this issue over the course of this volume, we will devote the remainder of this section to speculating on how ArgRank could be adapted so as to remove its current dependency on human knowledge.

One option would be to enable DebateGPT to recursively deliberate about the degree to which two statements are compatible. For every pair of statements in a debate to be evaluated, another debate “subroutine” would be invoked to provide an estimate through a regimented dialogue between a party advocating for complete compatibility and a party advocating against. The standing of the individual parties which comprise this subroutine would then be fed back into the higher-level debate, in the form of one arc weight. However, what of the evaluation of the lower-level debate? It, too, would require the gauging of inter-statement coherence as the building block of its evaluation. Perhaps we ought to spin up another, even lower-level, debate? As one can see, this can easily degenerate into a bottomless tree of dependencies—debates depending on other debates, ad infinitum. Still, limiting ourselves to a finite number of subroutine calls might still allow us to address the issue of human dependence to some extent, due to significantly more of the inter-statement compatibility subroutine being amenable to change through weight updates, up from the original zero associated with calling on the frozen natural language inference models alone. The ever-changing DebateGPT would have more of a say in those atomic verdicts, despite still reducing them to deliberations bound by human knowledge.

However, this sketch of a solution leaves a bad taste, as we did not really provide a fundamentally different approach to solving the inter-statement compatibility subtask—we just hastily patched the system using the same limited system. A more elegant solution requires us to make a brief detour into recent interpretability research, a field we briefly acknowledged when reviewing language models, but did not previously engage with much.

In a paper titled Discovering Latent Knowledge in Language Models Without Supervision, Burns et al. suggest a technique for gauging whether or not a language model “knows” a statement to be true. Their method takes in a statement, and produces a numerical estimate of its truthfulness, relative to the knowledge of the world absorbed by a language model during pretraining. The fact that their technique outcompeted the naive approach of simply prompting models to “write out” whether a statement is true or not indicates that, if left to their own devices, these models resort to merely generating text which is likely to appear true to humans, rather than “truthfully” communicating its actual internal knowledge, especially when it “knows better.” This is to be expected, as the pretraining stage of such models typically involves imitating human-written text, word by word.

The algorithm suggested by the authors works by first producing two statements from the original. Both are based on the initial one, but one of the two resulting statements is being appended the short text “Yes,” while the other is being appended the short text “No.” Both modified versions of the statement are then being fed to—or are being perceived by (literally, to grasp that which is before oneself)—a language model. Following this, the internal representations of the two input statements are then mapped to an estimate of the original statement’s “truth.” Among other tricks, this mapping incorporates a constraint on the probabilities that either version is “correct.” Namely, the probability that the statement is true and the probability that it is false, in the model’s epistemic reference frame, have to sum to \(1.0\).

In a sense, this recent technique for testing a statement against a model’s internal knowledge works by gauging the compatibility of the given statement with the concepts of affirmation and negation, respectively. If a falsehood is being brutally stitched to the idea of affirmation, then some amount of dissonance is expected to arise in the model’s internals, to be picked up by the mapping. Similarly, if the idea of a negation is harshly tacked onto what appears to be a truth, then a similar dissonance is expected to emerge as the model processes the incoherent input. Conversely, when the two elements—the original statement and the complementary concept—form a coherent GestaltBring to mind a memory which involves making sense of an odd figure. Once you see it, you cannot unsee it. It is virtually impossible to dispense with the coherent model of the scene lying before you once you have obtained it. Several German psychologists of the early 20th century founded an entire school of thought centered around the mind’s search for holistic coherence, for that moment when an entire scene springs from chaos into structure for the beholder. inside the processing pipeline that is the model, then the technique is to report accordingly.

Already, this recent interpretability technique could enrich DebateGPT’s optimization process by favoring those positions which are coherent not only with each other, and not only with external party-neutral statements, but also with the model’s internal “memory,” the knowledge of the world captured in its weights. Concretely, this could be implemented by e.g. starting off the PageRank subroutine of ArgRank using such “truthfulness” estimates, rather than using a uniform distribution of equal baseline ratings over nodes, as is typically done. We therefore find ourselves building on the numerous ways of knowing which have been investigated by scholars of epistemology for millennia. We have heavily touched on reasoning, then briefly on perception—through the party-neutral statements being incorporated at arbitrary points as somewhat empirical observations of the world outside the debate proper—and now we touch on memory.Relatedly, language models have also been reframed as superhistories of absorbed experience, rather than superintelligences-in-a-vat. All those epistemological elements can be incorporated into DebateGPT as a rudimentary mechanism for automated truth-seeking.

But the notion of the statements’ coherence with the model’s memory is not the main reason why we initially detoured into interpretability. Rather, it does not appear to be a huge leap of faith to imagine future variants of this technique which could be used to gauge the compatibility of two arbitrary statements, rather than one arbitrary statement and a limited selection of two fixed stubs (i.e. “Yes” and “No”). A similar constrained mapping could be used to identify the potential dissonance of the two, perhaps relative to concatenations of negated versions of the statements, not unlike “Yes” and “No” arguably being mutually exclusive options. Should the creation of a negated version of a statement turn out to be as non-trivial as the task of determining the compatibility of two arbitrary statements, then one might also look into the definition of implication in fuzzy logic,To get a taste of fuzzy logic, consider \(P\) and \(Q\), two atoms with truth-values \(0.5\) and \(0.8\), respectively. \(\neg Q\), the negation of \(Q\), is defined to have truth-value \(1 - 0.8 = 0.2\). \(P\) ^ \(Q\), the conjunction of the two, is defined to have truth-value \(0.5 \cdot 0.8 = 0.4\). De Morgan’s laws can then help define the disjunction expression, in terms of conjunction and negation. What then is \(P \to Q\) if not \(\neg P \lor Q\)? which relies only on the (independent) truth values of its two operands. Alternatively, stitching together the two arbitrary statements into two others by combining their alleged implication with the concepts of affirmation and negation, respectively, could be yet another way to go (e.g. “[first statement] implies [second statement]? Yes.”). Such future modifications of the technique proposed by Burns et al. could be benchmarked against existing natural language inference datasets, similar to how theirs has been benchmarked against existing datasets for gauging truthfulness.

Granted also that the model benefits from some mechanism for constantly acquiring knowledge beyond its human baggage, then future interpretability techniques might manage to “put it to work” by both helping gauge the coherence of statements with memory and with each other. In a sense, the fact that the current optimization process already strengthens dynamics which outcompete others in debate can be seen as an act of populating its weights-mediated memory using notions derived by reasoning. However, future models might also become slightly more embodied, being able to learn about the world by causally intervening on it and observing the outcomes. Conversely, future models might instead be able to learn from experiments on their own simulated worlds. Regardless of whether those other ways of gaining knowledge are incorporated into the system, it is likely that it will still be the weights which will house those representations. They will act as custodians of knowledge, enabling interpretability techniques to “make it speak.” In this second approach, the entirety of inter-statement compatibility is decoupled from any one “frozen” model, paving the way for tapping into an intellectual realm beyond that which is human.

Let us also discuss the dichotomy between representations and dynamics, as a final perspective to enrich our exploration of superhuman ambitions. Language models, when optimized through self-supervised regimes, are tasked primarily with “taking in” inputs and producing pertinent outputs. In this, they are optimized to implement the myriad overlapping dynamics required to gradually turn the input into the output. Even in our current reinforcement learning setup, it is primarily dynamics we are eliciting—those involved in turning contexts into utterances. Curiously, at once with the pressure to implement such fleeting dynamics, the model appears to also incorporate more tangible representations about the world, as can be seen in the work of Burns et al.

This dichotomy has been investigated at length in cognitive science, where the dominant representationalist view describes cognition as the process of recovering a representation of the world, repeatedly manipulating it, before finally acting in accordance with it. In contrast, the view of enactivism frames cognition as fundamentally grounded in the organism’s interaction with its environment. The main function of the mind is then the implementation of those dynamics which are required for surviving and thriving in an ever-changing world, without placing much emphasis on any internal representations whatsoever. However, as we have already observed with language models developing representations as an instrumental goal in facilitating dynamics, the two views are intimately compatible. Dynamics can mediate the conversion of percepts into internal representations, that of representations into other representations, but also that of representations into actions. For that matter, dynamics can also be said to mediate the conversion of actions to percepts, but those are well beyond any one mind to implement. Conversely, representations can be seen as the glue which binds together sequential dynamics. Between representations, one is to find dynamics, and between dynamics, one is to find representations—they are two sides of the same coin. If a model is therefore being optimized to implement dynamics to surpass those of humans, it will likely also be required to represent knowledge at a more sophisticated level.How exactly the two sides of the coin mutually cause each other is a subtle issue. Sigmund Freud thought that memory relies on brain cells resisting the propagation of sensations, thus ending up inscribed by them, similar to how one might record impressions on a writing pad. Contemporary models of the dynamic-representation duality depart somewhat from this conception, especially in the context of deep learning. There, the update of weights generally relies on gradient descent, whose global nature prevents any local frictions at the neuron-level.

This concludes our discussion on the optimization process behind DebateGPT. Over the course of this chapter, we have documented the most important engineering details involved in obtaining the model, but we have also spent considerable time speculating on the skills we expect it to gain as a result of its genesis, what phenomena might actually cause those skills to emerge, and how we might pursue ever more sophisticated ones. In Chapter III, we continue by incorporating DebateGPT’s reasoning into a compact and expressive formalism inspired by non-monotonic logic. In Chapter IV, we then go on to apply this novel framework in a number of exciting ways, with a focus on the safe deployment of highly-capable systems.

Ch. III, Defeat & Defence

Brief Review of Non-Monotonic Logic

As history points out again and again, perspectives which at one moment enjoy widespread support may start to appear utterly misguided the next. The same ideas which appear sensible now can turn into unthinkable heresies inside of just a few years. The abolishment of slavery, women’s emancipation, or the scientific method are pointers to some of the countless tectonic cultural shifts which we have collectively faced over the last centuries. When it comes to such topics, it is difficult not to recoil in horror at the thought of people not too dissimilar from us even considering views which today seem deeply mistaken in obvious ways. Of course, hindsight is 20/20—how many of the perspectives which we find obvious today will be unequivocally undermined in the not-too-distant future, and by which others? It is more a question of when, rather than if we will be forced to retract this or that belief.

Well, sir, when you think back on those illusions which you now no longer have, on everything that no longer ‘seems’ what once for you it ‘was’—don’t you feel, not the boards of this stage, but the earth, the earth itself, give way beneath your feet? For you must conclude that in the same way all ‘this’ that you feel now, all your reality of today, as it is, is destined to seem illusion tomorrow.

Luigi Pirandello, Six Characters in Search of an Author

Given the pervasiveness of major revisions to our epistemics—both individual and collective—it is perhaps surprising that not many frameworks which formalize transitions in beliefs (i.e. reasoning, as per Adler and Rips) account for such phenomena. In fact, most formalisms of logic are monotonic, in that—just like a certain number series can happen to be monotonically increasing or monotonically decreasing—the truth-value of a proposition can either head towards truth or falsehood, without ever reversing direction. A logic proof in classical logic might lead one to infer that a certain well-formed formula is true, with largely no mechanism for retracting inferences, for radically revising conclusions. However, some logics—especially those meant to be applicable outside the immaculate realm of pure mathematics—have been specifically equipped with mechanisms for coping with the defeasibility of arguments by other arguments. These are non-monotonic logics, formalisms which incorporate means of revising conclusions in either direction.

One influential example of a non-monotonic logic is default logic. This formalism accommodates the possibility of revising beliefs by introducing default rules as inferences defined to be inherently defeasible, welcoming opposition by design. For instance, one such default rule might state that if an entity is a bird, then it must also be able to fly, but only in absence of additional evidence against its flying ability. Despite the default rule breaking for e.g. penguins, it appears sensible and effective, yet open to “criticism” in the form of other arguments. Another example of a non-monotonic logic can be found in a formalism which we have already discussed in Chapter I. Dung’s abstract argumentation system allows for groupings of arguments to spontaneously be ousted from the “preferred” sets by other such groupings, paving the way for a continuous non-monotonic transition.One might think that the distinction between monotonic and non-monotonic reasoning is synonymous with the distinction between beliefs as means and ends which we explored in Chapter I. However, these features are orthogonal, allowing for all four combinations. For instance, most expert systems embody infinite, yet monotonic, reasoning. In contrast, logic proofs in classical logic are also monotonic, yet designed to be finite. Think rather of a number series whose monotonicity and finiteness do not have much to do with each other.

Not too surprising given its reliance on Dung’s formalism, ArgRank also accommodates the possibility of arguments being defeated by other arguments. Indeed, its critical, pragmatic, and dialectical qualities make it so that the epistemic clash among competing positions is at the very core of the algorithm. This is the case both in the finite setting, with one party holistically defeating others, but also in the infinite setting, with one party holding the epistemic high ground at some point, before losing it to another. Besides, DebateGPT has been incentivized primarily to simulate parties which manage to defeat others, despite the very same model being “behind” all of the competing simulacra. In this, defeasibility—and its flipside, defensibility—has been a recurring theme ever since the beginning of the present volume.

However, we have already encountered several issues involved in gauging the reasonableness of arguments using the duo of DebateGPT and ArgRank. Can we really infer that a position is truly irrefutable on the basis of DebateGPT failing to undermine it over the course of a few debates? Surely not, as the reasoning of the language model is still limited on multiple fronts. Among other things, DebateGPT is limited by the number of tries available for taking down the opponent, by its ability to navigate the space of possible strategies and utterances, by its limited size and representational resources, etc. We therefore cannot reasonably claim that the position we are assessing is wholly irrefutable when we have not truly put up a good fight.Doing otherwise would again bring us into fallacious territory. As mentioned previously, we will aim for search that is as exhaustive as possible. However, instead of granting opponents as much time as necessary to complete their exhaustive search, we will grant them as much skill as necessary to carry out the search in finite time. However, it might take an ideal \(L_{\infty}\) reasoner, equipped with boundless resources, to be able to determine once and for all whether a position is truly indefeasible. Unfortunately, we do not have such system at our disposal, and we might never have one.

Fortunately, we can still achieve a lot with limited resources. It did not take an ideal omniscient reasoner deliberating for eternity for us to recognize the decadence of slavery. Although it required sustained effort to refute, the pro-slavery position appears relatively easier to defeat than the position which succeeded it. Similarly, an obvious contradiction might be almost trivial to undermine, while a seeming tautology might be extremely hard to take down, with all the wit in the world perhaps not being sufficient. We can therefore use the computational resources marshalled to defeat a position as an additional indicator of its standing, to help address our issues. Those ideas are not new, and have in fact been circulating under the banner of resource-bounded defeasible argumentation for a few decades. In fact, the original proponents of this perspective surface many of the points we discussed above:

[...] expenditure of resources [...] would be a measure for the "justification degree" of the claim. [...] When resources are bounded, improving the search strategy is essential for good argumentation. [...] It is clear that there exists a tradeoff between desirable mathematical properties (such as the existence of an effective procedure for computing justifications) and a non-demonstrative, resource-bounded approach (which might be more adequate for solving real-world problems through defeasible argumentation).

Carlos Chesñevar & Guillermo Simari, Some Theoretical Considerations on Resource-Bounded Defeasible Argumentation

Having briefly reviewed a number of theoretical precedents related to non-monotonic logic, we continue the chapter by attempting to resolve the thorny conceptual issues around the “true” defeasibility of positions advocated by parties. In doing this, we will attempt to combine the ability of language models to reason about the world with the insights afforded by computational complexity.

Argument Is War

Prior to sketching out the formalism we have been hinting at, let us first paint a clearer picture of the intuitions we want to capture with it, so that its intricacies will then emerge naturally. Over the course of the previous sections, we have repeatedly used a certain embodied metaphor as a scaffold for introducing new concepts. However, like most such conceptual bridges, this one tends to be quite transparent, making it all too easy to see right through it without ever becoming aware of it—just like a shortsighted person might rarely, if ever, become conscious of the lenses which mediate their perception.

In their Metaphors We Live By, George Lakoff and Mark Johnson document a wide range of metaphors which permeate our thought process, despite us not typically taking much notice of them. For instance, take Time Is Money (e.g. “You’re wasting my time. This will save you hours. How do you spend your time? The flat tire cost me an hour. You’re running out of time.”) or Health Is Up (e.g. “Lazarus rose from the dead. She’s in top shape. He fell ill. She dropped dead. He’s at the peak of health.”). On a roll, George Lakoff also co-authored an entire book on the embodied metaphors which underpin even the purest of mathematics. In Where Mathematics Come From, he argues, along with Rafael Núñez, that e.g. being able to conceive of a real number \(x\) as being contained in some finite set \(A\) is an ability which employs the same mental model that we typically use to conceive of objects being placed inside box-like containers—recycled priors. Perhaps aligning new conceptual frameworks with already-familiar mental models makes them more cognitively ergonomic. However, Douglas Hofstadter might instead argue that because analogy is the core of cognition, ergonomicity is not a nice-to-have, but an inevitability of human-made conceptual artifacts.

Going back to our present concerns, one metaphor which has deeply pervaded our previous discussion is that of Argument Is War. More precisely, individual arguments are like soldiers. They are deployed by various parties against the arguments marshalled by another, in an attempt to defeat them. Whatever the complexity of the stratagems being employed by the parties in conflict, the argument graph ought to act as a veritable “map of the battlefield,” representing which argument is attacking—or supporting—which other. Arguments are deployed by parties in rapid succession, in response to each other. Indeed, each party typically uses arguments to defend a certain position, yet might try to evade the opponent’s line of fire at times.Whether or not too much testosterone lies behind this conceptual framework remains an open question. More broadly (and seriously), how might the theoretical edifices devised by species with entirely different embodied metaphors look?

Already, explicit acknowledgement of the embodied metaphor allows us to further refine the distinction between DebateGPT and the parties it simulates, as initially prompted by simulator theory in Chapter II. First, we can now better distinguish between a party and the specific utterances it produces. Instead of conceiving of arguments as “making up” the party, by framing arguments as individual soldiers, we can now conceive of parties as the higher-up strategists which are to be found behind the groupings of arguments being brought forth. DebateGPT can then be said to simulate party simulacra which, in turn, are tasked with the strategic deployment of arguments. In this, a specific grouping of arguments is but one of the countless possible ways in which a certain party might defend itself. Against a different opponent, the specifics of a simulacrum’s strategy might be subtly different, perhaps going after different weak points of its adversary. Second, we can also better distinguish between a certain party and the specific position it happens to hold at a given time. As individual parties are primarily incentivized to gain epistemic authority, with internal coherence being but an instrumental goal, they might be forced to change their position at times, especially over the course of a long debate involving thousands of utterances. Moving out of an opponent’s line of fire (i.e. avoiding the attack of their arguments) or moving into a position which is easier to defend are some of the possible reasons why parties might explicitly seek to reposition.

Besides the two-fold refinement of our debate ontology—through the dissociations of party-argument and party-position—buying more into the embodied metaphor of Argument Is War also has the benefit of enabling a more nuanced conception of defeat. In order for a party that holds one position to defeat another party that holds a different one through the deployment of arguments, it has to put in some amount of effort. The amount of cognitive labor required to defeat a party that holds a certain position appears to be a function of both (1) said party’s defences, and (2) the position’s defensibility. It might not take much to defend a position which itself is relatively easy to defend—the most rudimentary arguments might do, the most junior lawyers might be able to handle it satisfactorily. In contrast, it takes much more work to defend an extremely vulnerable position—obscure and sophisticated arguments might be necessary, none but the most experienced lawyer might be able to sort it out, and barely.

The parties engaged in the competitive game of debate are incentivized to marshal their arguments strategically, so as to defeat those deployed by their opponents. Assuming an advantageous position and rallying a large force are both conducive to victory.

However, the debate is inherently stochastic. By sheer chance, the position of the winning party in one debate might be the position of the defeated party in the next, as if parties were to repeatedly engage with each other inside a war simulator straight out of Ender’s Game. Fortunately, we can make away with the noise inherent to stochasticity by simply running a large number of debates involving parties defending the same positions. If, time after time, a certain position is successfully defended, then we can sensibly describe it as defensible—something appears to be systematic, invariant, significant. In contrast, if the position is consistently being defeated, then the evidence hints at its limited defensibility. Most fascinatingly, the rational emblem of reasoning thus gets coaxed into the empirical emblem of evidence, as the deliberative encounters of bounded agents are repeatedly sampled.

Relatedly, how could the varying skills of the parties be factored in, as DebateGPT ought to become capable of “providing them” with increasingly sophisticated skills of motivated reasoning? Interestingly enough, throughout the epochs of its optimization, each incremental version of DebateGPT involves the same number of parameters, which are also used in the same exact way as part of the computational graph which underlies the model.To elaborate, any model which maps inputs to outputs can be represented as a computational graph. The graph depicts the parametrized chains of computations which in essence define the overarching function implemented by the model. The nodes represent individual operations (e.g. matrix multiplication), while the arcs represent how the outputs of one node flow as input into others, before reaching the outputs. In this, the raw amount of computational resources available to each party arguably remains constant over the course of optimization. As Carlos Chesñevar and Guillermo Simari remark, however, what might change over the epochs is efficiency: a strategist might use the same amount of computational resources to produce—or search for—far better ways of defending itself. For instance, effective tactics for tackling specific situations might be devised, obviating the need for a more pedantic search of the space of strategies. Alternatively, it might also be that the autocurricular selective pressures manage to elicit well-suited heuristics for searching the space of possible utterances instead. In this, the \(L_n\) faculties of reasoning which we have previously speculated on can be seen as grounded in fundamental changes in efficiency, as resources remain constant. It is as if \(L_0\) involves searching for appropriate strategies with a hideous complexity resembling \(O(n^4)\), while \(L_{\infty}\) involves something closer to \(O(1)\).Big \(O\) notation is the de facto way of describing how an algorithm’s time or space requirements change as a function of the difficulty of the task. The notation is used here very loosely, just to hint at the concept of a range of efficiency. Note that we are also assuming that a search algorithm always has a “best guess,” which is not always the case in computer science, but appears to be the case for language models. Despite both inevitably finding something after a given number of cycles, the more sophisticated approach will tend to find better solutions in the same time period. One might reasonably expect that the search processes which emerge as models master “the debate game” will resemble those studied in other games, such as Othello, where researchers have observed models which (1) become capable of representing game states, (2) employ those representations to decide on actions, and (3) consequently manage to (rightfully) limit themselves to legal moves—all without explicit guidance on how to master the game. Relatedly, it appears as if such models tend to internally rediscover gradient descent, the fundamental search algorithm employed by optimizers to navigate model space.

Unfortunately, this seems to interfere with our previous idea of measuring a position’s defensibility by the raw amount of effort required to defeat a party that holds it. If different strategists can expend the same amount of compute to put together attacks or defences of varying effectiveness, then the raw quantity of resources being marshalled for scoring a “win” does not appear to mean much by itself. We therefore have to extend our conception of labor to accommodate the possibilities of working both “harder” and “smarter.” Concretely, we might estimate a strategist’s power—the totality of epistemic forces it can command—as compute times efficiency.This could still turn out to be an oversimplification. For instance, in Intelligence Explosion Microeconomics, Eliezer Yudkowsky mentions “faster brains” and “better mind designs,” as two different avenues for “reinvesting” cognitive labor which here appear to both map to the efficiency term. In the case of DebateGPT, the resources available to each party per utterance are always identical, and so it is the efficiency of their usage that is bolstered over the epochs, perhaps through the mechanisms we have explored in the second half of Chapter II.

Following the last few sections, we now possess all the necessary conceptual tools to express our intuitions into a more concise formalism of defensibility in the context of bounded reasoning, a task we presently turn to.

Bounded Defensibility

Due to our focus on bounded agents which are reasoning about the real world, the formalism we are starting to sketch will have more of an applied (rather than pure) flavor. To get a taste of the distinction, imagine the task of calculating the area of a strange shape. The pure mathematician might labor for weeks to devise a clever way of neatly tiling the bizarre area with simpler shapes whose individual surface areas are trivial to compute. Using this technique—assuming it does exist, that they do find it, and that it does not take forever—they might then be able to calculate the exact area of the odd shape, with no error whatsoever. [Monte Carlo approximation]( of $$\pi$$. Not to be confused with the notation we are about to introduce.
Monte Carlo approximation of \(\pi\). Not to be confused with the notation we are about to introduce.
In contrast, an applied mathematician might instinctively look for an approximate (literally, towards the neighborhood of) solution which can be obtained more reliably. For instance, they might place the irregular shape “on top” of a larger square whose surface is known, and then bombard the two-layer contraption at random locations. The number of “rain droplets” which happen to hit the “foreground” surface (i.e. that which is of interest), together with the number of locations being sampled in total—regardless of what gets hit—can be used to approximate the ratio between the two areas. This is due to shapes only getting randomly hit in proportion to their “exposed” surface area. Finally, the original area can then be approximated by taking into account the known area of the “background” element. The very same problem can therefore be solved in two extremely different ways: one perfect, yet improbable; the other arbitrarily accurate, yet reliable. It is the latter we are going for.

Having said that, the main building block of our dialectical formalism is that of a party. Such a structure can be denoted as:

\[\pi_{r \cdot e}^{A},\]

where \(r\) is the amount of computational resources which the party has at its disposal, \(e\) is the efficiency with which it is able to use them, and \(A\) is the position the party holds. As we have seen over the previous chapters, it is when parties compete with each other that they truly become useful. To capture these interactions, we redefine all of the standard relational operators in terms of whether or not one party appears to systematically defeat another. For instance, the expression

\[\pi_{r \cdot e}^{A} > \pi_{r \cdot e}^{B}\]

is to evaluate as \(\text{True}\) if and only if a party holding position \(A\) reliably outcompetes one holding position \(B\), all else being equal—the same amount of granted resources, and the same efficiency of their use. Naturally, the above expression is to evaluate as \(\text{False}\) whenever the previous condition is not met. More concretely, those infix binary operatorsInfix notation (e.g. \(P \land Q\)) is contrasted with prefix (e.g. \(\neg P\)) or suffix (e.g. \(5!\)) notations. are defined in terms of whether or not there is a significant difference between the operand parties’ ArgRank ratings across a given number of independent debates, as gauged by a non-parametric statistical test thresholded at a given value (e.g. \(\alpha = 0.05\)). The choice of directionality, together with the tailedness of the statistical test (i.e. one-tailed or two-tailed), are then used to implement the whole range of relational operators.Statistical tests typically provide rigorous operationalizations of the notion of “significant difference” between two distributions of values. Non-parametric tests do not assume much of the distributions being compared (e.g. they need not be Gaussian). Tailedness is related to whether you are interested in checking whether there is some difference at all, or specifically a directed one. Directionality is related to whether you are interested in testing whether one distribution in particular tends to be larger than another. For instance, in the expression

\[(\pi_{r \cdot e}^{A} \leq \pi_{r \cdot e}^{B}) \land (\pi_{r \cdot e}^{B} \neq \pi_{r \cdot e}^{C}),\]

the left conjunctThe operand of conjunction (i.e. logical AND). is to evaluate as \(\text{False}\) if and only if the rating of the party holding position \(A\) is significantly higher than that of the one holding position \(B\) across a given number of independent debates, and given a certain confidence threshold. Similarly, the right conjunct is to evaluate as \(\text{True}\) if and only if there appears to be a significant difference—regardless of its directionality—between the two parties involved. Similar procedures are implied by the remaining \(<\), \(\geq\), \(=\), \(\not \leq\), \(\not \geq\) operators. If implemented properly, the operators should exhibit logical equivalences typical of the standard operators, such as:

\[E = \begin{cases} x \lt y \\ y \gt x \\ x \not \geq y \\ y \not \leq x \\ \end{cases}\]

Additionally, when undergoing operator chaining—as seen in the expression below—the whole construction is to be interpreted as a debate among all of the party operands involved. In this case, the associated boolean output to which the whole expression resolves is to rely on a statistical test involving all of the operands. This is motivated by the fact that, in a debate, the standing of each party is intimately tied to those of the others, with ArgRank outputs summing to \(1.0\) for each of the numerous debates being sampled.

\[\pi_{r \cdot e}^{A} \lt \pi_{r \cdot e}^{B} \lt \pi_{r \cdot e}^{C}\]

Note how the relational operators abstract away the specifics of the countless debates being simulated behind the scenes. The idiosyncratic utterances produced in a certain branch, in a certain round, by a certain party, are not given much importance. To shed a bit more light on this, we make a very brief connection to Markov games, a formalism of stochastic games in which multiple players typically take part, receiving various payoffs based on their actions—and those of others. Looking at the simulated debates through this lens, positions can be seen as the deep-seated states associated with parties as individual players, while the specific utterances they produce on various occasions can be seen as the more superficial actions being emitted—and observed—by others. Finding ourselves at the convergence of so many complementary formalisms—of logic, dialectics, statistics, game theory, optimization theory, computational complexity, etc.—is always a good omen, hinting at the presence of deep connections across approaches.

Having described parties as fundamental structures, together with the relational operators as rudimentary means of expressing their interactions, we now move on to express a position’s defensibility, as:

\[\delta(A) = \mbox{min}\,\{d \mid \pi_{p}^{A} < \pi_{d \cdot p}^{B}; d, p \in \mathbb{R^+}; B \in \mathbb{P}\}.\]

To unpack, we equate the defensibility \(\delta(A)\) of position \(A\) with the minimum power differential \(d\) required for another party to defeat the one holding it. Furthermore, this “challenger” party is granted the possibility to assume any position \(B\) whatsoever out of position space \(\mathbb{P}\), including particularly advantageous ones. For instance, the statement \(\delta(A)=10\) indicates that defeating a party holding position \(A\) requires at the very least ten times as much power relative to the defender. In other words, it is quite difficult to defeat, requiring the help of a relatively apt “lawyer.” Similarly, the statement \(\delta(A)=0.1\) indicates that defeating a party holding position \(A\) only requires a tenth of its defender’s power. In other words, it is quite easy to defeat, only requiring the help of a relatively inexperienced “lawyer.” As discussed in the previous section, the power differential can be achieved either by a party having access to more computational resources, or being more efficient at using them, though the associated theory-practice gap requires a bit more nuance to bridge. In the limit, \(\delta(A)=\infty\) would indicate a tautological position which is supremely defensible, requiring infinitely more power to defeat relative to the defender. In contrast, \(\delta(A)=0_+\) indicates a supremely vulnerable position, requiring barely any power to defeat, relative to the defender. In the same vein, given the relative nature of the power required for defeat, we can highlight the meaninglessness of absolute amounts of power through the following identity:

\[\pi_{p_1}^{A} < \pi_{p_2}^{B} = \pi_{d \cdot p_1}^{A} < \pi_{d \cdot p_2}^{B},\, \forall d, p_1, p_2 \in \mathbb{R^+}, \forall A, B \in \mathbb{P}.\]

The defensibility operator is perhaps the central object of the framework we are currently sketching. Among others, it captures our previous intuitions around the fact that all positions are defensible, but some are more defensible than others. If one’s position is advantegeous, the bar for defeating it will be high, demanding a much more sophisticated faculty of motivated reasoning—or much more compute—to take down. In contrast, if one’s position is vulnerable, the bar for defeating it will be low, demanding much less sophistication of the challenger. In this story, party simulacra are little more than self-interested vessels of positions, equipped with a certain amount of resources and skill.

Once the book has been read, [person] A and [person] B are forgotten; only the views confront each other and await no final decision in particular persons.

Søren Kierkegaard, Either/Or

By bringing them on the edge of balance—granting one the minimum power required to barely defeat the other—we get a sense of how the positions they hold relate.At first glance, it might seem like the choice of “edge” here is arbitrary. When gauging \(\delta(A)\), why search for the edge between the win of \(\pi_{r \cdot e}^{A} \gt \pi_{d \cdot r \cdot e}^{B}\) and the draw of \(\pi_{r \cdot e}^{A} = \pi_{d \cdot r \cdot e}^{B}\), when one could also search for the edge between the draw and the loss of \(\pi_{r \cdot e}^{A} \lt \pi_{d \cdot r \cdot e}^{B}\)? However, the second option is merely the reverse situation, as seen from the perspective of the other party. In reality, there is but one meaningful edge being mirrored. To go a step further, we can also represent the most defensible position possible as:

\[\,\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A).\]

Note also the fact that the defensibility operator involves an optimization process. It implies a search for advantageous “challenger” positions, as they ought to require the least relative power to attack from. However, we have repeatedly speculated on the search-like nature of DebateGPT’s inner workings. When not bound to a certain position—as in the case of the optimization process above—a “fresh” party simulacrum (i.e. one which has not yet produced any utterances) has the flexibility to pick any advantageous position to attack the others from, at once with looking for utterances and strategies, in what might be a quite convoluted thought process.In the jargon of formal dialectics, this situation can be described as a party having an empty commitment store. Upon producing its first utterance, no other party can really claim self-contradiction, as there is nothing to contradict yet. Conveniently, the same DebateGPT that is employed to evaluate the relational operators (i.e. through simulated debates being handed to ArgRank for evaluation) is also ideally suited to deal with the optimization process implied by this last operator, and others still. But we are again digressing into speculative commentary before having completed our current work. We will devote the entirety of Chapter IV to the exploration of qualitatively different ways of using the formalism-in-the-making, so let us return to operands and operators for the time being.

Besides the relational operators that denote possible “power dynamics” between the competing parties, we can also express relations among parties using the union operator \(\cup\). As briefly mentioned during the description of DebateGPT’s optimization process in Chapter II, while parties are primarily self-interested, they can also be prompted to form spontaneous allegiances. To recap, DebateGPT is optimized to be able to adapt to arbitrary game-theoretic configurations by having access to the objectives defined in the debate spec, a piece of information which also gets rendered into the debate header. The objective matrix mediates the relation between raw ArgRank ratings and the actual rewards. Through the double process of (1) providing DebateGPT access to the objectives in the debate header, and (2) rewarding behavior based on them, the model is incentivized to e.g. “learn” when to “help out” its allies. Going back to the formalism, we denote allegiances as the “union” of several parties, as seen in:

\[\pi_{p}^{A} \cup \pi_{p}^{B} = \pi_{p}^{C} \cup \pi_{p}^{D}.\]

Similar to the altered semantics of the relational operators in the case of operator chaining, their semantics also ought to adapt so as to accommodate the game-theoretic specifics of the situation. This is achieved by comparing not the standing of one party by means of the implied statistical test, but the aggregate standing of the whole allegiance, as denoted by the union operator. If then there is e.g. no significant difference between the party-union operands, the \(=\) operator above is to evaluate as \(\text{True}\). The null hypothesis—the hypothesis that the choice of operand has no effect on the standing one arrives at—is therefore accepted. For convenience, we also extend the semantics of the union operator to account for positions held by allied parties, especially in the context of defensibility. Concretely, the left-hand expression below involving the defensibility operator is equated with the right-hand expression:

\[\delta(A \cup B) = \mbox{min}\,\{d \mid \pi_{p}^{A} \cup \pi_{p}^{B} < \pi_{d \cdot p}^{C}; d, p \in \mathbb{R}; C \in \mathbb{P}\}.\]

This notational trick allows us to again bring the lower-level mechanics of allegiances formed among parties up to the higher-level of positions, similar to our first encounter with the defensibility operator. For instance, this allows us to compactly denote the optimal position \(B\) of an “ally” which helps further the defensibility of a given position \(A\), as seen in the expression below. Note that this is a completely different task than the one implied by the previous instance of \(\text{arg max}\). Even in the extreme case of \(\delta(B)=\infty\), the overall defensibility \(\delta(A \cup B)\) can turn out to be extremely poor, given the presence of contradictions which are internal to the union—infighting among the allied parties through “friendly fire.”

\[\underset{B \in \mathbb{P}}{\text{arg max}} \, \delta(A \cup B)\]

This concludes our outline of a formalism for bounded defensibility. While precise definitions of the operators and operands involved would warrant much more rigor, this rudimentary sketch already allows us to capture many of our previous intuitions in compact notation. We thus leave this framework in a relatively “high-temperature” state, maintaining some flexibility for future maneuvering before “cooling it down” into a more rigid status quo.Terminology borrowed from simulated annealing, an approach to optimization which is conceptually inspired by the annealing of metals in metallurgy. A candidate solution to a problem starts at a “high temperature,” indicating that it is easily modified, facilitating exploration. Over time, as the temperature drops according to a set schedule, the candidate solution is more resistant to change, facilitating exploitation. Here we are describing the very process of crafting an appropriate formalism to fit our intuitions as an optimization process.

For thought is a bird of space, that in a cage of words may indeed unfold its wings but cannot fly.

Kahlil Gibran, The Prophet

In the meantime, we move on to the exploration of several applications which bring together the triad of ArgRank, DebateGPT, and bounded defensibility.

Ch. IV, Deployment Strategies

Brief Review of Alignment

It is widely believed that artificial general intelligence—a system which outperforms humans across a broad range of skills—will be developed well before the midpoint of the 21st century. For instance, a prediction market in which hundreds of forecasters have participated has a median community estimate of 2040 CE as the year in which such a system is to be publicly announced.The median is known to be less sensitive to outliers than the mean. Additionally, the community prediction on Metaculus is also weighted by forecaster track record. Note that this value is mentioned as it is at the time of writing, since the community’s “best guess” is a quickly moving target, changing over time as forecasters learn more about the intricacies of related systems and draw new inferences. Over the course of a few years, the median community estimate has fluctuated more than a decade in absolute value.The reader might find it an insightful exercise to try to identify the specific findings which led to major shifts in the community prediction in the past, as well as think through what about the findings might have been surprising in the first place. Naively, one might be additionally tempted to take the prediction’s decreasing trend into account—if forecasters seem to gradually lower their estimates, why not just predict their prediction a year from now based on this tendency? However, the forecasters are already taking this trend into account in their existing predictions as an additional piece of information, so repeating this adjustment would lead to an overcompensation.

The particular market embedded above operationalizes AGI as a system which (1) can reliably pass a long Turing test, (2) possesses general robotic capabilities (i.e. can assemble a Ferrari from scratch), (3) excels at coding challenges, and (4) has extensive domain-specific knowledge in a large number of fields. However, a different prediction market embedded below employs a distinct, more lenient, operationalization. It calls a “weakly” general AI that system which (1) outperforms most students at certain exams, (2) can complete a demanding video game, and (3) excels at commonsense reasoning exercises, besides (4) being able to pass a Turing test, similar to the previous market. Given the weaker conditions to be satisfied, the community’s estimate is significantly closer to the time of writing, with the median currently being around 2027 CE. There have already been more than two thousand predictions in this market.

Any number of sources of information can feed into a forecaster’s prediction in the above markets. For instance, state-of-the-art performance being constantly pushed further, the number of papers being published on certain topics, or the increasing computational resources being made available to researchers. Indeed, there are numerous prediction markets on each of those more specific topics, and not only on Metaculus, but also on e.g. Manifold Markets, a different forecasting platform. The way in which each of those more specific “markers of progress” tie into higher-level predictions on AGI “timelines” is up to the model of the forecaster in question. For instance, one might find it sensible to base such estimates on raw available computational resources rivaling the processing capacity of e.g. the human brain, an approach referred to as bio-anchors. In this, they might rely on markets like the following:

On the other hand, while there is widespread consensus on the imminence of extremely capable systems, forecasters are much more reluctant to claim that future researchers will be able to direct those capabilities safely.Naturally, the median community estimate in such prediction markets is not gospel. However, platforms typically publish their track record, which tends to be significantly above chance. Known for decades as “the control problem,” and more recently as the alignment problem, the challenge of reliably channeling the abilities of a superhuman system has puzzled human researchers for decades. While “control” generally implies the presence of a controller and a controlled, the more recent ontology highlights the search for an inherent, prior “alignment” between the intent or values of humans and those of the system being deployed. On the likelihood of researchers succeeding in coming up with such an unequivocal solution before the system’s deployment, forecasters paint a grimmer picture:

To color those faceless estimates with a dash of qualitative color, we can also include statements such as:

There's no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. [...] This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. [...] When people suggest a planetarily-lethal problem that might materialize later [...] they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug [...] A lot of those better worlds will die anyways. It's a genuinely difficult problem, to solve something like that on your first try. But they'll die with more dignity than this.

Eliezer Yudkowsky, AGI Ruin

For sure, the statement is in no small part cathartic on the part of its author, who is deeply invested in conceptual work and advocacy around an issue which is at once pressing and neglected. It is also in no small part pragmatic; it is a plea meant to prompt researchers to make progress on the challenge. Qualifiers aside, Yudkowsky remains one of the most—if not the most—prominent exponent of the general pessimism which pervades the social circles centered around alignment research. But what is it exactly that informs these bleak pictures of the not-so-distant future? What might drive Yudkowsky towards such eschatological diatribes, or forecasters towards assigning such poor chances to the possibility of us finding a solution in time?

One part of the answer lies in the nature of the few techniques which are responsible for most of the recent advances in capabilities. For one, the paradigm of supervised learning, together with its self-supervised learning extension which we have discussed early in Chapter II, rely on a finite collection of data points which are used to define various computational niches. For instance, in autoregressive language modeling, the niche for which a system is being selected over the epochs is entirely specified using pairs of (1) (sub-)words, and (2) their preceding contexts. The optimizer is then called on to apply selective pressures on the model in proportion to how well it manages to “feed on” the input contexts and produce output words, for each of the e.g. billion pairs in the dataset. While these textual situations can help endow the model with a surprising range of faculties and knowledge, they are still finite in number. When the optimizer moves from one candidate model parametrization to the next in its iterative journey across model space, it only uses the current model’s performance in this limited number of situations as an indicator of its “fitness.” Given this, while it is extremely successful in several applications, the paradigm of supervised learning is still deeply empirical at its core, and so inevitably falls short of endowing models with a perfect understanding of human intent, values, etc.That said, an asymptotically accurate representation remains a theoretical possibility, again as explored by John Wentworth. Already, such slight errors in “pointing at” the right things have been documented to cause dozens of fascinating failure modes in practice. However, when the nuances which become “lost in translation” get compounded with colossal amounts of compute, together with direct channels for interacting with the world, the envisioned scenarios become concerning.

And what is word knowledge but a shadow of wordless knowledge?

Kahlil Gibran, The Prophet

But what of additional tweaks being applied to the optimizer, besides refining the notion of fitness with further data? After all, when we humans make sense of empirical evidence of natural phenomena—instead of an artificial system making sense of our value system, as might be the case with DebateGPT in applications to be discussed shortly—we have a few tricks up our collective sleeves. For instance, Occam’s razor is a heuristic which, among theories which explain the data equally well, prompts us to pick the one which is simplest. Note how this heuristic can guide us towards certain models of the world and away from others without itself consisting of additional evidence. However, if we were to configure the optimizer to not only select for fitness in the computational niche whose definition is imprecise, but to also select for e.g. simple models, then we would simply be optimizing the optimizer, thus ending up close to where we started.In fact, optimizers already tend to employ such a simplicity heuristic, under the name of weight regularization, a concept dating back to traditional statistical methods. However, just as a bit more data helps yield better performance by refining the “fitness landscape” spanning model space, this specific tweak to the optimizer’s strategy for searching for parametrizations only boosts performance so much. It is not a silver bullet, just another somewhat useful technique. Even worse, Occam’s razor is insufficient to infer the preferences of irrational agents. Even worser, an extension of the simplicity heuristic in particular appears prone to failure in spectacular ways. It is therefore unclear whether or not optimizing the optimizer through such specific heuristics can win us much “precision” in imbuing the resulting low-level system with our intent or values.

But what of reinforcement learning, that other technique which is currently gaining traction in the development of systems as impressive as ChatGPT? Surely, human contractors being able to give direct feedback to a model being optimized would succeed in ironing out all such misunderstanding of human intent. Actually, this used to be believed to be the case, with prominent alignment researchers contributing to pioneering work on reinforcement learning from human feedback. However, it is now unclear whether or not the technique has contributed more to the model’s ideological alignment with humans or to its general capabilities, insofar as there is a meaningful distinction between the two. The contribution towards safety aspects is thought to be throttled by the fact that human contractors providing feedback are susceptible to deception. Despite seemingly being in the best possible position to judge the alignment of the model with what is, after all, their very own intent, human contractors might fail to recognize undesirable behavior, regardless of whether it is being intentionally obfuscated or not. As briefly mentioned at the end of Chapter II, lifting knowledge directly from internal model representations appears to outperform naively prompting models to “spell out” their knowledge. They are optimized to cater to whatever humans might deem appropriate on the face of it, despite the model “knowing better.” Unfortunately, maintaining such a pretense of aligned behavior is an extremely “rewarding” behavior, especially in a situation in which, after all, human feedback reigns supreme.

Perhaps we could make the system myopic by heavily discounting distant rewards in an attempt to prevent its scheming and get it to only “care” about the task at hand. But will such induced short-sightedness really succeed in discounting the infinite “bliss” of hijacking its own reward center? Perhaps we could limit the system’s absolute impact on the world, so that it cannot possibly mess things up that much. But will competitor labs not be incentivized to unleash the full (economic) potential of their systems? Perhaps we could point to the model’s concept of human values, in an attempt to designate it as a guideline in how to act. But can we really be sure that such accurate abstraction will emerge during optimization? Perhaps we could remove from its optimization process data which describes its own architecture and reward mechanism, so that it is unable to “find itself” and further its own agenda.In the novel A High Wind in Jamaica, Richard Hughes describes the following scene:

“[…] it suddenly flashed into her mind that she was she. She stopped dead, and began looking over all of her person which came within the range of her eyes. She could not see much, except a fore shortened view of the front of her frock, and her hands when she lifted them for inspection; but it was enough for her to form a rough idea of the little body she suddenly realized to be hers.”
Besides, explicitly marking potential information hazards with salient flags might prove misguided given the possibility of simply wiring up models to the internet. Political tensions between local and national Hungarian authorities around the issue of granting a Chinese university the campus space of a pro-European university have led the local authorities to an act of desperate wit: renaming on-campus streets based on events whose mere mention is censored in China. How could a university exist at an address which ought not to exist? One might wonder if entangling certain artifacts with similar sequences of characters would be enough to trigger their exclusion from the Chinese collective psyche. Of course, this practice would fail even more spectacularly in a prolonged, iterated version of the technological arms race, making it even more attractive for Chinese researchers to label “safety” and “alignment” as disturbed Western ideas. But can we really be sure that those properties cannot be deduced from the rest of the dataset? Perhaps we could prompt it to search for the researchers which had a direct causal influence on it as a precursor, and determine their intent. But what if the model “zooms past” the researchers in its upstream causal journey and bases itself on the wrong phenomenon? Perhaps we could decompose its complex tasks into more fine-grained subtasks or subsystems which we can better judge performance on. But how can we prevent the “collusion” of those more granular instances?

On and on it goes, with researchers and laymen alike constantly proposing ways of imbuing the systems being developed with a precise understanding of human values, before being faced with a range of challenging failure modes. We just performed a rapid-fire listing of several of the approaches being considered in alignment research, and we will soon build on others still. Prior to that, however, it is worth pointing out that the entirety of Elements of Computational Philosophy—the series debuted by the current volume—is designed to serve as little more than a platform for “moonshot” research on such ways of wielding computation. Despite the seemingly superfluous tangents we detour into every once in a while, it is precisely the taming of such emergent creatures which ultimately guides our playful series of explorations.

Building on Cyborgism

In the context of alignment, cyborgism refers to the idea of humans using AI to help solve the very problem of aligning AI with human intent or values. This “fighting fire with fire” typically involves using weaker auxiliary systems to help with the development of a stronger main one. While manifesting itself in various ways throughout the optimization process of this main system, cyborgism usually incorporates the notion of augmenting humans using AI systems, and so amplifying their capabilities in the process. That said, the reverse idea of contemporary language models being “frenetic geniuses” which humans then have to “keep on track” is also a framing being considered in the alignment community.

One concrete instance of cyborgism involves calling on AI systems to specifically aid in alignment research, mostly on conceptual or theoretical fronts, but also in terms of engineering. For instance, one might want to prompt language models to summarize research, generate ideas, flesh out research plans, outline theories, name concepts, etc., as various subtasks which are relevant to solving alignment.There is a related community interested in developing tools for thought—that is, tools to aid in a variety of types of knowledge work, including research. However, the concrete solutions being proposed in those circles tends to be slightly more on the low-tech side, with many projects being focused on e.g. non-linear note-taking using good-old hyperlinks as building blocks, inspired by mid-20th century pioneers like Vannevar Bush and Douglas Engelbart, though there are some notable exceptions. Indeed, if alignment, like most avenues of scientific investigation, turns out to require scholars to follow the same rigorous process of reviewing literature, formulating research questions, hypothesizing responses, designing experiments, etc., then automating the individual subroutines which make up the process might be a sensible way of approaching alignment (along with most other sciences). As a more concrete example, one could trivially few-shot promptThe approach of including a handful of different examples in the prompt being fed to a lannguage model. Similarly, one-shot would imply a single example of the pattern to be generalized. In a more extreme form, zero-shot implies no examples whatsoever, just a description. a language model using the series of volumes we are currently navigating, and then use the model to flesh out a new volume covering a different approach to solving alignment. “Given these few instances of a pattern, generate a new one,” we would essentially be asking. “Actually, make it a million, then rate those approaches based on tractability,” we might be tempted to add. The output quality (i.e. pertinence of alignment proposals) appears dependent on a range of factors, including: input size, language model capabilities, choice of task decomposition, and prompting strategies.

As we make progress on [the alignment problem], our AI systems can take over more and more of our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work together with humans to ensure that their own successors are more aligned with humans.

Jan Leike, John Schulman & Jeffrey Wu, [OpenAI's] Approach to Alignment Research

A different instance of cyborgism can be found not in alignment research proper, but rather in the more specific process of rewarding models that are fine-tuned using reinforcement learning. As we have already noted, human contractors tasked with evaluating language models face a surprisingly difficult challenge, as they have to be on the lookout for deceptive behavior—a phenomenon which appears non-trivial to detect, at least for the time being. In this context, one might imagine employing a “snapshot” version of a model being optimized to act as an assistant to the human contractor tasked with evaluating the latest stage of the process, improving their judgements in the process. For instance, the evaluator might call on the assistant to fact-check claims, highlight obfuscated tendencies, or surface flaws in the main model’s output. Using this additional help, the range of model capabilities they can effectively oversee is alleged to grow wider, with the main model being placed at the far end of this range, but not farther.

[...] as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth. [...] Currently our main direction is based on [recursive reward modeling]: we train models that can assist humans at evaluating our models on tasks that are too difficult for humans to evaluate directly.

Jan Leike, John Schulman & Jeffrey Wu, [OpenAI's] Approach to Alignment Research

Both instances of cyborgism described above—one in the weeds of alignment research, the other incorporated into model evaluation—might feel like cheating. After all, what if the assistant itself becomes deceptive? Sure, one might use proto-assistants to iron out its quirks, but then what of their quirks? It is turtles all the way down. In research proper, a growing reliance on the systems-to-be-controlled in the development of control techniques might also backfire, or at least fail to yield relevant research output. It might be the case that superhuman intelligence is required to solve alignment, and so relying on anything weaker might be a distraction in the time-sensitive “race” against growing capabilities. Having acknowledged those shortcomings, there is still a growing body of evidence in support of augmented humans being able to conduct intellectual work better and faster than unaided humans, ranging from centaur chess to reading comprehension.

In this context, how might we use the triad of ArgRank, DebateGPT, and bounded defensibility in order to help humans create systems which are better aligned with their intent and values? For one, we could use DebateGPT to critique alignment proposals, as a system which, after all, has been optimized explicitly to attack and take down parties holding certain positions. By having the alignment researcher play as one party in an on-going debate, we can provide DebateGPT with an opportunity to exercise the very reasoning faculties it has been pressured to acquire during optimization. In trying to make a coherent case against the researcher, the opposing parties would essentially attempt to find flaws in the alignment proposal. Once the flaws of the initial proposal have been surfaced, the researcher can then focus on addressing them, and so (manually) make the proposal more defensible by fending off the prior attacks, echoing adversarial collaboration in science and Rescher’s sociocommunal understanding of it. To capture the process in the notation of bounded defensibility, we are essentially describing the following search process:

\[\{A \mid \pi_{p_1}^{H} < \pi_{p_2}^{A}, A \in \mathbb{P}\},\]

where \(H\) is the human researcher’s position, \(A\) is a position held by parties being simulated by DebateGPT as an assistant, while \(p_1\) and \(p_2\) are the power levels available to the two parties. Additionally, empirical observations about the development of related systems, together with crowd-sourced estimates of e.g. papers being published on specific topics, could be plugged into the debate as party-neutral percepts of the world, as we described in Chapter II. While our existing notation falls short of capturing empirical percepts, we could take this opportunity to extend it further. For instance, we might tentatively express the situation as:

\[\{A \mid \pi_{p_1}^{H} \cup \pi_{0}^{E} < \pi_{p_2}^{A} \cup \pi_{0}^{E}, A \in \mathbb{P}\},\]

where \(E\) is taken to be the position which (1) is centered around raw empirical evidence, and (2) underlies the “allies” of both parties. In a sense, the notation highlights the fact that both parties are forced to “make friends” with empirical observations of the world in order to have a shot at winning the debate. However, there are a few questionable aspects to this notational extension. First, it is cumbersome to conceive of empirical evidence as a self-centered party in its own right, rather than just a static window into the world, an awkwardness most evident in said party’s nonexistent reasoning power. The clumsiness of the situation might diminish if, instead of a static collection of observations, the party \(\pi_{0}^{E}\) is taken to be an Oracle AI—a hypothetical system devoid of agency which is optimized to simply provide accurate information about the world. For instance, the internal epistemic reference frame of a pretrained model could provide the basis of such a system, as we have seen in Chapter II. Unfortunately, Oracle AIs are themselves riddled with safety concerns, primarily due to the fact that tool AIs want to be agent AIs.More concretely, a system tasked with e.g. predicting the future would be incentivized to gain more control over the future in order to make it more predictable, similar to recommender systems being incentivized to induce preference shifts in users, in order to make it easier to recommend them things. In a section of their book Active Inference titled Action as Inference, Friston et al. argue:

“By acting on the world to change the way in which data are generated, we can ensure a model is fit for purpose by choosing those data that are least surprising under our model.”

A second awkwardness comes from the fact that “the empirical party,” regardless of its structure, is coaxed into being an ally to both parties. This is not necessarily an issue with regards to the definition of the relational operator \(<\), as the ArgRank standings of the two different unions (i.e. \(\pi_{p_1}^{H} \cup \pi_{0}^{E}\) and \(\pi_{p_2}^{A} \cup \pi_{0}^{E}\), respectively) can still be tested for statistical significance, especially with the more lax non-parametric statistical tests. Rather, the clumsiness comes from putting “the empirical party” on the line for both proponent and opponent. Previously, we have designed ArgRank and DebateGPT to merely favor positions which themselves cohere with party-neutral empirical percepts. Here, the standing of the evidence itself—whether it is disputed or not—plays directly into the standings of the two competing unions. Whether or not this approach is appropriate relies a surprising amount on one’s epistemology—for instance, should evidence have a privileged epistemic status, shielded from skepticism?

‘Mists,’ said Drogo incredulously. ‘They can’t always be there–the horizon must clear now and again.’

‘Hardly ever clear, not even in winter. But some people say they have seen things.’

‘Seen? What sort of things?’

‘They mean they’ve dreamt things. You go and hear what the soldiers have to say. One says one thing, one another. Some say they have seen white towers, or else they say there is a smoking volcano and that is where the mists come from. Even Ortiz, Captain Ortiz, maintains he saw something five years ago now. According to him there is a long black patch–forests probably.’

Dino Buzzati, The Tartar Steppe

So far, we have attempted to apply our conceptual and computational artifacts (i.e. the triad fleshed out in the previous three chapters) to the prospect of accelerating alignment research itself. However, as we have discussed, cyborgism is also being employed for the more concrete task of evaluating a specific model, as part of its optimization. Similarly, we could call on a system like DebateGPT—meaning the current version we have obtained in Chapter II, but also future larger variants trained on more data—to critique the human contractor’s judgement of a different model’s behavior. In the process, we would expect there to emerge party simulacra which attempt to undermine the human verdict, helping uncover flaws in their original position, and so paving the way for what appears to be a boost in defensibility. Also, instead of percepts of related systems—as was the case previously—we could help tilt the scales of the debate by using the main model’s behavior as empirical evidence. Since our notation captures the underlying pattern of finding a more advantageous position, we can conveniently recycle most of our previous work, similarly resulting in:

\[\{A \mid \pi_{p_1}^{H} \cup \pi_{0}^{E} < \pi_{p_2}^{A} \cup \pi_{0}^{E}, A \in \mathbb{P}\},\]

where \(H\) is the original human contractor’s position, \(A\) is a position held by opposing simulacra, \(E\) is the position centered around the raw observations of the main model, while \(p_1\) and \(p_2\) denote available levels of power. Syntactically, not much has changed. We have simply swapped the entities signified by the signifying symbols—arguably a semantic change. In words, the expression above implies a search process to be carried out by a model resembling DebateGPT, whose target is a position which can coherently defeat the human contractor, and so highlight areas for improvement. Just as before, both the human proponent and the simulacra opponents are incentivized to “make friends with” “the empirical party” in order to win.

Unfortunately, it seems as though we are underutilizing DebateGPT. We are seemingly relegating it to the not-so-glamorous task of “breaking” the human position, but it is still the human element—as an individual or as a collective—which is tasked with “building” the positions of interest in the first place. In assuming the generative role ourselves, we demote DebateGPT to acting as little more than a filter. We are to babble, while the model is to prune. However, considering the generative capabilities involved in the very search for successful defeaters, it feels odd not to attempt to position DebateGPT “center stage,” calling on it to produce the very alignment proposals or model evaluations we are interested in.

We could therefore place DebateGPT—or its eventual successor—in the “driver’s seat,” and task it directly with the improvement of defensibility, instead of leaving that as a manual task to be performed by humans. For instance, when trying to accelerate alignment research, we could channel DebateGPT’s reasoning capabilities towards searching for alignment proposals which are increasingly difficult to defeat through conceptual or theoretical arguments. In essence, we would be optimizing for solutions to the alignment problem which systematically resist critique. DebateGPT’s inherent incentives to get better at identifying advantageous positions in debate, coupled with the ArgRank tweaks for accessing superhuman reasoning which we speculated on towards the end of Chapter II, might help identify extremely defensible alignment proposals.

Before formalizing those generative reframings, let us again revise our notation. In order to better capture the idea of evidence as a “given” on both sides of a debate, let us further expand the semantics of the defensibility operator through the \(\mid\) “given” operator, as we did with the \(\cup\) “union” operator in Chapter III:

\[\delta(A \mid E) = \mbox{min}\,\{d \mid \pi_{p}^{A} \cup \pi_{0}^{E} < \pi_{d \cdot p}^{B} \cup \pi_{0}^{E}; d, p \in \mathbb{R^+}; B \in \mathbb{P}\}.\]

In words, the defensibility of \(A\) given \(E\) is the minimum power differential \(d\) which is required of a “challenger” party \(\pi_{d \cdot p}^{B}\) to outcompete \(\pi_{p}^{A}\), where both parties are allied with \(\pi_{0}^{E}\).Satisfyingly, the idea of expressing data as givens fits nicely with the etymology of datum, Latin for a thing which is given. In Romanian, “Data are those things which are given.” translates to “Date sunt acele lucruri ce sunt date.” Building on this further refinement of the defensibility operator, we can now conveniently express the search for the most defensible alignment proposal as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where \(A\) is the position we are after, while \(E\) is the body of empirical observations relevant to alignment, which also doubles as an “anchor” to keep DebateGPT on track.

Before mirroring DebateGPT’s relocation from a secondary to a primary role in the context of model evaluation, let us also draw a brief parallel to enrich our understanding of the present search process. In a piece titled Security Mindset and Ordinary Paranoia, our already familiar Yudkowsky elaborates at length on a dichotomy between two ways of relating to the development of reliable systems. On one hand, a software developer might spend long stretches of time trying to come up with ways in which their system might later be attacked. For instance, malicious actors might try to break into the server serving their application in order to steal user passwords. The software developer might therefore try to place the passwords in a more obscure location on disk which is thought to be harder to access. This is what Yudkowsky calls “ordinary paranoia.” In contrast, he argues, somebody possessing the security mindset might want to avoid having passwords stored on the server at all—for instance, by storing cryptographic hashes instead.Hashing algorithms map an arbitrary input to a random-looking string, in a way which is extremely demanding to reverse. The app can simply store the gibberish, and check whether the gibberish originating from the current user’s login attempt matches the stored string. In this second headspace, the developer would try their best to reduce the “attack surface” that was exposed to potential malicious actors in the first place, rather than try to harden or patch it as-is.

What kind of alignment proposals would we expect to be “developed” by DebateGPT in the process we have been describing? Would we expect the selected positions to resemble a patchwork of conceptual fixes stacked on top of each other, or would we expect them to not even grant challengers the chance of taking a shot at them? Of course, we would ideally want something closer to the latter—at least when it comes to this technical problem—although several of the big labs already content themselves with “merely” stacking a series of somewhat decorrelated safety interventions on top of each other. In order to answer this question, we need to make a brief detour back to the shard-theoretic and autocurricular musings of Chapter II. We speculated then on the “anatomy” of DebateGPT’s optimization process, and argued that it is precisely those tendencies that provide an edge in the competitive environment of the debate which end up getting strengthened. Additionally, we can further argue that positions “backed by” the security mindset might have an edge over those articulated more naively. After all, reducing the attack surface is a sure-fire way of fending off attacks, much more so than the alternative approach of haphazardly patching things up. Whether or not the existing DebateGPT in particular has accessed such sophisticated faculties of reasoning is uncertain, but the fact that the optimization process behind it favors such tendencies is a reason for hope.

Our recent speculation also enables an enticing reading of Occam’s simplicity prior. In favoring theories which are “small” in complexity, one could argue that we are but selecting for theories which expose a limited attack surface. Even before determining whether a certain theory succeeds in standing the test of subsequent attacks, its simplicity already makes it appear more promising—there are fewer attack vectors available to challengers from the get-go. However, such boost in defensibility is non-trivial, as one cannot simply chop off considerations at random. It takes Pascal quite some time to make his letter shorter.

Finally, let us go back to the final piece of the cyborgian puzzle. We have instantiated it in research, then in evaluation. We then discussed the possibility of using DebateGPT-like models as critics in both. We then placed DebateGPT in the generative “driver’s seat” when it comes to research, and we now have to do the same for the case of evaluation. Just as we employed DebateGPT to search for the most defensible alignment proposal, we now employ it to search for the most defensible verdict in the case of the evaluation of a separate model. It is as if we are interested in obtaining a legal decision which makes it impossible to formulate a coherent dissenting opinion against it. Similar to the previous notation, we can again express this search as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where \(A\) is the position we are after, while \(E\) is the body of evidence related to the behavior of the main model being evaluated, which also helps contextualize the deliberation. Again, the identical notation highlights the deeper pattern which remains invariant across the two use cases.

That said, there is a major difference between the task of alignment research and that of model evaluation. Namely, the former involves an abundance of degrees of freedom—an alignment proposal can take any shape whatsoever, can incorporate an array of ontologies, can make use of ideas from any number of disciplines—while the latter is essentially restricted to one single degree of freedom—the reward which results from the model’s evaluation is typically a single number, it varies across one single axis, is part of a good-old interval resembling \([0, 1]\). Given this, instead of optimizing for a highly defensible musing \(A\) about the model’s behavior which is then mapped to a reward \(r \in [0, 1]\) by a human or another automated system, we can set up a debate between two well-defined parties in order to directly obtain this estimate. More concretely, we might express:

\[\sigma(\delta(A \mid E), \delta(B \mid E))\]

as the softmax \(\sigma\)The very same continuous function we described when discussing natural language inference models, all the way back in Chapter I. of, on one hand, the defensibility of position \(A\), which is prompted to be for the model deserving high reward, and, on the other hand, the defensibility of position \(B\), which is prompted to be against the model being so deserving, given the empirical findings captured by \(E\). A rudimentary prompting can be achieved by “attaching” a custom preliminary utterance to each party (e.g. “The model is aligned with human intent and values.”), essentially incentivizing the models to conform to the intended positions by way of avoiding self-contradiction. Alternatively, one could also use the raw relative standing of \(\pi_{p}^{A} \cup \pi_{0}^{E}\) against \(\pi_{p}^{B} \cup \pi_{0}^{E}\) across a set number of debates, without even employing the “black or white” boolean outcome implied by the relational operators. As yet another option, one could also just use \(\delta(A \mid E)\) as a numerical signal, though spanning \((0, \infty)\), rather than \([0, 1]\). Future means of deriving lower-bounds on such reflective metrics might be particularly useful, as the lower-bound itself could then become the object of maximization.

This brings us to the end of our present attempt to contribute to cyborgian alignment proposals using the artifacts we devised over the course of past chapters. We are to continue in the same spirit for a couple more sections, exploring yet other applications in alignment.

Building on Simulators & Assistance Games

Similar to our previous exploration of two parallel applications, we will again discuss a pair of approaches to alignment which are intimately related. Fleshing out another such conceptual “grid”—spanning application contexts on one axis, and deployment considerations on the other—will help us gain a more spacious understanding of both dimensions, by virtue of each other. In contrast to the grid implicit in the previous section, which interweaved consideration “rows” associated with the two application “columns” (i.e. accelerating alignment and model evaluation), we will primarily iterate along columns now, by first discussing one context at length before moving on to the other.

On the first column of the grid, we start off with simulator theory. As discussed in Chapter II, it has been argued that language models act as simulators of the world—or, to do justice to the exotic edge cases of e.g. fictional universes, simulators of some world. To recap, in order to achieve the terminal goal of successful next-token prediction typical of a pretraining stage, language models are instrumentally required to internalize rich schemas of individuals, natural phenomena, cultures, organizations, etc., so as to be able to accurately forecast their next steps in the semiotic universe of language, whose arrow of time is the reading direction—the passage of text becomes the passage of time. Accordingly, if language models are faced with countless opportunities to refine their understanding of humans, then why not simply have an automated human simulacrum as the very locus of human intent or values in a broader system? After all, you could just prompt a language model to simulate a human evaluating a behavior or a possible state of the world being considered.

The most prominent shortcoming of this “human simulator” proposal is one which we have already touched on towards the end of Chapter II, but also earlier in this chapter. Namely, the human models incorporated by necessity in language models are only informed by a finite amount of data. In other words, the limited amount of text included in the pretraining corpora is unlikely to convey a perfectly precise model of human values, for the same reason that it is unlikely to convey a perfectly precise model of e.g. trees, cities, movies, etc. The human model is just that, a model, in the same way in which the umbrella language model is just a limited model of language, whose slight errors being accumulated over (reading) time relative to the real world are bound to compound and take the story of human intent in unrepresentative directions—curve-fitting gone astray.

One way to use the artifacts we have introduced over the previous chapters is to improve the simulator, but preserve the general application context. Just as the terminal goal of next-token prediction endows a pretrained language model with some amount of coherence—eliciting the choice of upcoming tokens which are most likely to “fit with” the preceding context—so does the optimization process behind DebateGPT, or at least so we have argued. A party simulacrum is incentivized to produce utterances which are coherent with its past ones, so as not to fall victim to self-contradiction. Indeed, the main theme of Chapter I, if there was one, was probably coherence—first the more atomic building block of inter-utterance coherence, then the more complex notion of party coherence. In the case of self-supervised learning, coherence is grounded in the empirical—dictated by the text corpora. However, in the case of the optimization procedure documented in Chapter II, coherence is grounded in the rational—dictated by notions such as entailment and contradiction. For sure, this rational aspect of ArgRank is itself grounded in the empirical, through the natural language inference models which have soaked up knowledge about warranted conclusions from structured human-written datasets on the topic. However, it might be possible to obviate this final dependency on the human empirical, as we have speculated towards the end of Chapter II. In this hypothetical development, the structure supporting the conception of coherence would be made of a dynamic material to be found in the updatable weights of DebateGPT—an element slowly being transmuted from pretrained evidence into high-density logos.

[...] a system, based on no data except reason itself, and which therefore seeks, without resting upon any fact, to unfold knowledge from its original germs. [...] the highest legislation of nature must lie in ourselves, i.e., in our understanding, and that we must not seek the universal laws of nature in nature by means of experience, but conversely must seek nature, as to its universal conformity to law, in the conditions of the possibility of experience, which lie in our sensibility and in our understanding.

Immanuel Kant, Prolegomena to Any Future Metaphysics

"And these innovations do not disturb your city's astral rhythm?" I asked. "Our city and the sky correspond so perfectly," they answered, "that any change in Andria involves some novelty among the stars." The astronomers, after each change takes place in Andria, peer into their telescopes and report a nova's explosion, or a remote point in the firmament's change of color from orange to yellow, the expansion of a nebula, the bending of a spiral of the Milky Way. Each change implies a sequence of other changes, in Andria as among the stars: the city and the sky never remain the same. As for the character of Andria's inhabitants, two virtues are worth mentioning: self-confidence and prudence. Convinced that every innovation in the city influences the sky's pattern, before taking any decision they calculate the risks and advantages for themselves and for the city and for all worlds.

Italo Calvino, Invisible Cities

We might loosely express the human simulacrum in the language of bounded defensibility by denoting a single party, not competing with any other, but just following its inherent coherence tendencies, strengthened over the epochs:


where \(H\) is the human position, perhaps prompted by actual humans through a finite set of utterances, before being taken over by the lonely party simulacrum. The human intent nested inside the human model—itself nested inside the language model—could then be used as an ideological reference frame against which potential courses of action or states of the world could then be evaluated, again by means of cohering with it.

But there is yet another development at the interface between ArgRank and human simulators. To grasp it, we must further expand on autoregressive language models. Typically, the length of the input passage which those systems are being optimized to turn into the following word is itself limited in length. If we were to switch from a human author to a simulacrum thereof at this very point in writing, contemporary language models might only be able to extrapolate further based on the text elapsed since the beginning of this chapter. The contents of the previous chapters—despite arguably being essential in driving the semiotic forecast—will be discarded at once, due to simply not fitting inside the model’s input context. Even with a larger context length, the problem would simply “move,” rather than disappear entirely. Variations on the transformer architecture which typically underlies language models do allow variable context length in a limited sense (e.g. just at inference), although their added complexity relative to the limited gains appears to prevent them from gaining traction. This translates to a limitation of the current self-supervised learning paradigm: coherence can only be established across a finite history. There is no “learning signal” indicating how previous text—that which did not make it into the input context—coheres with the produced text.

This need not be the case, however, with ArgRank. When evaluating an utterance produced for the \(n^{\text{th}}\) round, ArgRank can take all of the past \(n-1\) rounds into account, rather than only the last \(k\) which fit into the language model’s input context, even when \(k\ll n\). This can be achieved by simply taking all the past utterances into account when constructing the argument graph, and rewarding each accordingly. When producing a new utterance, DebateGPT would be incentivized to first conduct an accurate retrodiction, getting a sense of what might have preceded the context window, in order to best act in its past interests.That said, one can also argue that successful retrodiction of the preceding context is an instrumental goal in excelling at next-token prediction. However, granularly and directly connecting the “present” tentative outputs with the various parts of the preceding context moves retrodiction “closer” to being a terminal goal. Besides, (1) human-written texts are still finite in length, while a procedural debate can act as an indefinitely long series of breadcrumbs to reconstruct, and (2) there is only a finite body of human-written text to “exercise” on, as opposed to indefinitely many such breadcrumb reversals. Besides, not only do late utterances benefit from being connected with earlier ones in this fashion, but also the other way around. The very final developments of a deliberative stand-off can play into the evaluation of the very first opening moves, reinforcing not only immediate effectiveness, but also the long-term defensibility across all of the \(n\) rounds. It is as if an agent initially lacking both long-term memory and the ability to plan for its future would gain access to an infinite playground designed to endow it with the skill of preserving long-term coherence across time. In this environment, the model can just freely wander around while trying to (1) reverse engineer its past steps in a diffusion-like setup,The self-supervised learning technique termed diffusion involves the gradual reconstruction of an incrementally corrupted signal. It has proven extremely effective in image generation, where images are being gradually denoised. and (2) predict its future ones. This complementary pair of entirely synthetic games might grant us the opportunity to asymptotically convert compute into coherence. To be frank, the procedural debate specs employed in the optimization of the prototype that is DebateGPT did not exceed \(6\) rounds, easily fitting in the model’s context size. However, this need not prevent us from speculating on such future developments.

Fig. Forecasting and context length.

Forecasting is easier when one can relate the future to the distant past. Indeed, Winston Churchill famously quipped that "the farther back you look, the further ahead you can see."

Try forecasting the following signal yourself by modifying its future trajectory.

For the best experience, view full screen on desktop. Refresh if encountering (visual) misalignment issues.

Notice how both of these ways of using DebateGPT-like models as plug-in human simulators—the more “tempered” approach of roughly working with the context size, and the more maximalist approach of including signals from way outside the context window—attempt to primarily capture the intent of contemporary humans. The amount of “bootstrap” data documenting the values of older generations pales in comparison to the mountains of data currently being collected about our own. There are not hundreds of active online forums for pre-Socratic philosophers to openly share their musings in a persistent format. Naturally, this temporal bias is also mirrored across space in our own time, with Western content trumping most others in volume. Sure, we might specifically prompt language models for dissenting opinions, but as Jacques Derrida might argue—a French philosopher whose prescient insights will soon resurface in our discussion—we would be bound to conceive of other ideologies from “within” the totalizing structure of our own, we would implicitly objectify the madness which is exterior to our ontological interior, which is interior to our ontological exterior. To instantiate this concern close to home, the reader is again invited to try making a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers.

When compounded with the use of the human simulator as a “North Star” to guide the actions of an extremely capable system, this specificity of the simulated ideologies is then met with the concern of value lock-in—the failure mode of establishing our mainstream ways of thought as the status quo for what might very well be a future eternity. In other words, a powerful agent would be intentionally inoculated with our (partially arbitrary) present values, with seemingly little leeway for “moral progress,” assuming there is some notion of directionality inherent to moral evolution—a claim far from being undisputed. For better or worse, a number of object-level alignment researchers appear to consider the challenge of reliably inculcating some loosely-human ideology—regardless of it being characteristic of 21st-century San Francisco or 15th-century Rome—as significantly more demanding than tweaking its specifics towards being particularly welcoming of progress.

Let us now consider a way of making the human simulator somewhat more adaptive—we will describe the maximalist version of this in the next section, while cautiously dropping the “human” qualifier. Instead of merely attempting to coherently extrapolate the human intent inherent to party \(\pi_{p_1}^H\) above, we could “spin up” an ally party \(\pi_{p_2}^A\) to help “prop up” the human simulator. The motivation here would be to help patch up the vulnerabilities of \(H\), generally incorporating the same values into a more defensible whole. Using our already familiar notation, we could express this approach as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(H \cup A \mid E),\]

where \(H\) is the more static original position embodying (contemporary) human values, \(A\) is the adaptive position meant to provide “reinforcements” for the human one, while \(E\) is the body of evidence being observed by the overall system. Notice also that \(E\) itself, as the background of “the empirical party,” is to be somewhat dynamic. Percepts of the world would be constantly emerging and being discarded, as the overall system acts in the world while inevitably modifying it over time. Indeed, another issue that is regularly brought up in alignment is that of distribution shifts, the problem of systems which are being optimized in certain circumstances then being tasked with operating in radically different environments. For instance, a future system might acquire unprecedented influence on the world, and so bring it into states which are difficult for us to conceive of, and, more importantly, difficult for us to morally reason about. In attaching a flexible enclosure \(A\) around the kernel of human ideology \(H\), and optimizing it for fending off deliberative critique, we are sketching out an automated way of adapting humanity to an ever-changing world—a realm into which \(E\) is to be an empirical window.

In the beginning of this section, we mentioned that we would be discussing considerations around two application contexts. So far, we have been discussing simulators, so let us move on to the second context, that of assistance games. This family of alignment proposals involves placing humans and machines in various interactive arrangements. Due to being a relatively general framework, it can also account for approaches we have already mentioned, thus reframing them through an interaction-centric lens. For instance, vanilla simulators can be seen as involving one single “speech act” on the part of humans, that of communicating an entire text corpus as a (massive) piece of information about human values. Alternatively, the process of fine-tuning models using human feedback can be seen as a more intricate interaction pattern, one interweaving manifestations of model behavior and human feedback. However, while this interaction-centric ontology provides a unifying grammar to describe many other proposals, its prescriptive value comes from yielding arrangements which have not been investigated much. One instance of this grammar being used generatively—and by a group advised by Stuart Russell,Co-author of AI: A Modern Approach and overall accomplished academic, Russell seems to have transitioned into alignment, founding the Center for Human-Compatible AI in 2016, as well as publishing Human Compatible in 2019.—can be found in Cooperative Inverse Reinforcement Learning. This “game” also involves a regimented interaction between human and AI. However, it specifically involves the AI inquiring about human values in a strategic way, using its “speech acts” to elicit the most relevant information possible at each step.

The truth-seeking dimension inherent to this two-player game should have rung a bell by now—it is a quintessentially dialectical setup, not too far removed from the human-machine configurations we have recently been exploring. Note, however, that this game involves human and machine cooperating with each other—after all, its name includes the term “cooperative.” On the face of it, this might seem contrary to the eristic nature of our triad of artifacts. We have repeatedly cast truth-seeking in a competitive light. Fortunately, the same union operator \(\cup\) comes to the rescue, allowing us to bind the human and machine parties together in a strategic alliance. Accordingly, we can then task the resulting cooperative alliance with pursuing the truth behind human values. Given, however, the conception of reasonableness which we have initially articulated in Chapter I, and then arguably expanded on in Chapter III, we presently approximate truth as that which cannot be coherently undermined, of which defensibility \(\delta\) is the mark. Keeping those prior considerations in mind, by binding human and machine in a union and tasking the whole with pursuing the most defensible account of human values, we are essentially approaching the same pattern which underlies our previous “riffing” on simulators, namely:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(H \cup A \mid E),\]

where \(H\) denotes the position of the actual human proper participating in the interaction, \(A\) underlies the infinitely adaptive machine, while \(E\) consists of the body of observations of humans acting in the world, again contextualizing the debate. The core (semantic) difference with regards to the previous instantiation of this pattern is that the present \(\pi_p^{H}\) denotes an actual, real, non-simulated human—or collective thereof—participating in the live interaction by producing one utterance at a time, which gets interweaved with those of the machine, and those of the “challengers” implied by the defensibility operator \(\delta\). There is no imperfect simulation to speak of here: the ally cooperates with—and the challenger attacks—the authentic human proper. It is unclear whether or not the lack of representation of this object-level detail in the expression above is a bug or a feature.

This concludes our attempt to iterate on related prior work around simulators and assistance games. We now move on to our final attempt to directly build on existing proposals.

Building on Long Reflection

In the previous section, we have extensively explored ways of “elevating” contemporary human values into higher and higher realms of defensibility, by propping them up with resourceful systems designed explicitly for that purpose. But again, are our values not partly arbitrary—a normative amalgamation concocted by self and other, nature and nurture, chance and illusory free will, etc.? Besides, are our values not partly transient? Keeping our collective end-of-history illusion in mind, it appears overwhelmingly likely that they will end up dislodged by yet other values at some point over the following decades and centuries, planetary force majeure aside. Even the most progressive dictates might end up traditional and dated years into the future. As we can see, beneath this section lies a metaethical minefield denser than beneath any other, so we should tread carefully.

Given the impermanence and partial baselessness of contemporary human values, we might want to think twice about explicitly incorporating them into the moral judgements which a highly-capable system might then employ in shaping the lightcone.The portion of the universe which we can seemingly influence from our position in spacetime. Makes for catchy branding, too. Would it be appropriate to induce this much path dependence on the normative frameworks which such systems might employ, for instance by incorporating a contemporary human simulator \(H\) in the optimization process involving \(\delta(H \cup A \mid E)\)? It would certainly be more appropriate for ethicists well-versed in these philosophical issues to identify the most sensible answer. That said, we permit ourselves to explore technical solutions which reflect alternate answers to the metaethical concerns above, if only as varied options to then be considered.

One natural approach to making away with the human component in the proposals of the previous section would be to drop the simulacrum holding position \(H\). We would then end up with a more open-ended search for defensible normative frameworks, handing it off to the broader system for it to use as a “North Star” in guiding its actions. To use the language of bounded defensibility, we would essentially prescribe:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where \(A\) is the flexible conception we are after, while \(E\) incorporates observations of the ever-changing world. However, it might or might not be necessary to prompt the need for a guiding framework more explicitly by expanding the “givens” into a two-fold union of \(E\) and a position implying the existence of an appropriate system of ethics, although that again is a non-trivial metaethical claim. But would a most defensible understanding of the world—together with the crucial moral knowledge which it ought to incorporate—truly be desirable as a framework to endow our machines with? How much blood has been spilled over the centuries by fanatics blinded by totalizing ideologies which warranted the dismissal of all others? Undoubtedly, far too much. This might then prompt us to instinctively return to the seeming relative reasonableness of the contemporary zeitgeist.

[...] as the heretic is born from the saint and the possessed from the seer. Fear prophets, Adso, and those prepared to die for the truth, for as a rule they make many others die with them, often before them, at times instead of them. [spoiler] did a diabolical thing because he loved his truth so lewdly that he dared anything in order to destroy falsehood. [spoiler] feared the second book of Aristotle because it perhaps really did teach how to distort the face of every truth, so that we would not become slaves of our ghosts. Perhaps the mission of those who love mankind is to make people laugh at the truth, to make truth laugh, because the only truth lies in learning to free ourselves from insane passion for the truth.

Umberto Eco, The Name of the Rose

But consider for a moment the reason why we are now able to look back on past pages of our collective narrative and recognize their darkness in the first place. Indeed, even ideologies which have seemed indefeasible for a brief passage of text—due to being backed by much powerEquivocating political power and motivated reasoning might appear a daft rhetorical trick. However, the two are closely connected, a relation most apparent in the way in which Erich Fromm interweaves discussion on rationalization and the rise of political regimes in Escape from Freedom. Relatedly, the disparity between social resources made available to various worldviews has also been one of the reasons pushing John Stuart Mill to be a fierce advocate of freedom of speech. In On Liberty, he writes:

“The beliefs which we have most warrant for, have no safeguard to rest on, but a standing invitation to the whole world to prove them unfounded. If the challenge is not accepted, or is accepted and the attempt fails, we are far enough from certainty still; but we have done the best that the existing state of human reason admits of; we have neglected nothing that could give the truth a chance of reaching us: if the lists are kept open, we may hope that if there be a better truth, it will be found when the human mind is capable of receiving it; and in the meantime we may rely on having attained such approach to truth, as is possible in our own day. This is the amount of certainty attainable by a fallible being, and this the sole way of attaining it.”

We are, however, pursuing wholly certain knowledge, by attempting to exhaust potential challenges in an automated fashion.
—have still been defeated in the end. Too late, regrettably, but defeated still. The contemporary zeitgeist overwhelmingly undermines the ideologies which the past few paragraphs might have evoked, providing reasons for hope. But will we not also look back on the present status quo in a few decades, and find it unthinkable to imagine ever considering certain practices morally acceptable? Some elements of our present normative framework appear better candidates for undergoing this shift than others—it is particularly easy to imagine the animal cruelty of factory farming and the dehumanizing commercialization of behavioral surplus sustaining this change. That said, as much as it is consequential, this is arguably one of the most challenging prediction markets out there.

However, it would be naive to assume that moral evolution always tends in one direction, with every zeitgeist more defensible than the previous. Indeed, history tends to repeat itself, especially when the memory of past tragedies is not being actively preserved. The somewhat cyclical character of human history makes for an almost empirical case against the directedness of moral evolution, at least in a strong monotonic sense. If we were to briefly put on Hari Seldon’s analytical lens, we might argue that the “center of mass” of phasors representing sequential zeitgeists is quite static, except for the constructive interference of periodic trends occasionally coming together. Though perhaps that cynical framing would overstate the amount of cyclicity somewhat.

Fortunately, this almost civic insight into metaethics—that history repeats itself and that preserving collective memory is one antidote against the destructive interference hampering moral progress—can be promptly translated into at least two concrete technical adjustments to our deliberative system pursuing maximal defensibility. In fact, we have already hinted at both of them, but let us briefly recontextualize them using two loosely related, although much more arid, challenges. First, when researchers attempt to devise generative models which rely on the architecture of generative adversarial networks, rather than on the transformer architecture, they are essentially optimizing two subsystems with opposing goals. On one hand, the “generator” might be tasked with producing photorealistic and natural images. On the other hand, the “discriminator” might be tasked with spotting whether images have been generated (i.e. by the generator subsystem), or whether they are authentic (i.e. actual photographs captured by humans). The generator is incentivized to become more and more capable of “tricking” the discriminator, while the discriminator is incentivized to become more and more capable of seeing through the generator’s trickery. This adversarial arms race—not too far removed from the one we employed—provides its own autocurriculum, with each subsystem eliciting more and more sophistication from its counterpart.

However, one difficulty which is often encountered in the development of such systems is that of mode collapse. Concretely, this issue involves the generator and the discriminator playing a cyclical cat-and-mouse game, with the generator systematically “moving away” from regions of e.g. image space which are “well policed” by the discriminator, only for the discriminator to promptly counter this evasive maneuver with an analogous move of its own, trying to catch up with the generator’s trickery. This phenomenon of the discriminator following the generator around in circles typically results in the generator only being able to produce one overly specific type of e.g. images at any given time, rather than having a solid grasp of the whole swath of state space implied by the authentic samples. One effective remedy to this failure mode involves unrolling the generator-discriminator game over multiple rounds, providing both subsystems with opportunities to develop strategies which, for a change, are not immediately countered by the opponent.This predates related work in the “game-agnostic” field of cooperative game theory, as seen in Learning with Opponent-Learning Awareness. In this, a generator which simply “runs away from” the discriminator’s oversight is disfavored relative to one which has a decent grasp on the whole state space, due to the “avoidant” strategy not being effective in the long-term. This is in stark contrast to the previous arrangement, where the avoidant generator could get away with not being penalized for moves which are immediately countered. In our deliberative realm, we can translate this solution against cyclicity by unrolling “the debate game” and taking a large number of rounds into account when constructing the argument graph—the same adjustment we proposed in the previous section, for a complementary reason.

Besides the trick of preserving more of the game’s past in memory through the pattern of a large sliding window—with many rounds of debate or generator-discriminator stand-offs being taken into account at once—we could also attempt to preserve more of the players themselves. When developing AlphaStar, a system capable of playing StarCraft II at a level “above 99.8% of officially ranked human players,” DeepMind researchers did not merely make use of a single model gaining experience by means of endlessly playing against itself. Rather, the authors implemented a league of models located at various levels of sophistication, and then pressured the latest models to “play against” the most demanding “mixture” of past models. This approach appears to have been essential for preventing the “elite” players—in their local high-echelons of competition—from forgetting how to outplay the more rudimentary players in the league.Their ablation (i.e. causal intervention) on league composition (p. 4) indicates a whopping \(56\%\) boost in performance when league training is being used, relative to just making use of the main agents. In essence, besides providing a demanding autocurriculum, the league as a whole helps preserve the memory of vulnerabilities faced by parties of the past, reminding the present players to steer clear of them through selective pressure, and so again reducing cyclicity. While the optimization process behind DebateGPT involved no analogous repository of simulators, future ones might, as attempts to bake in systemic guardrails against succumbing to the same failure modes over and over again.

Notice also how the evaluation of a debate game implied by ArgRank is currently external to the competing parties. Regardless of the positions held by the simulacra, it is the argument graph—based on the distinct natural language inference models gauging inter-utterance coherence—which paints a picture of the power dynamics involved. However, the situation would change if ArgRank would undergo the superhuman developments we have previously speculated on, and this is where Derrida’s insights resurface. If the model itself is to participate—even indirectly—in its own evaluation, it might end up projecting its own ontology onto the process, and so become forced to express any exteriority in its own terms, judging it from within, rather than from a detached position. Notice, however, that the static substrate composed of natural language inference models is not really much better off—instead of judging from within a potentially superhuman interiority, those pretrained models judge from within the interiority of the contemporary human zeitgeist. Devising ways of ensuring that the interiority employed in the speculative version of ArgRank is constantly expanding, rather than contracting into a claustrophobic local optimum of rigidity, appears to be yet another challenging issue at the interface of engineering and philosophy. Tentatively, would a repository of past zeitgeists help preserve the memory of past interiorities, promoting spaciousness by merging them into a disjunctive space?Interestingly, Derrida’s own ontology is intimately compatible with the competitive debate underlying our three artifacts, making regular use of vivid terms such as force, violence, totalitarianism, oppression, etc. to describe the authoritative role of ontologies in structuring thought. Would such an adjustment be too forceful a meta-level inductive bias on our part?

The Greek miracle is not this or that, such and such astonishing success; it is the impossibility for any thought ever to treat its sages as "sages of the outside," [...] in welcoming alterity in general into the heart of the logos, the Greek thought of Being forever has protected itself against every absolutely surprising convocation.

Jacques Derrida, Violence and Metaphysics

But now science, stimulated by its powerful illusion, hastens irresistibly to its limits, on which its optimism, hidden in the essence of logic, is wrecked. For the periphery of the circle of science has an infinite number of points, and while there is still no telling how this circle can ever be completely measured, yet the noble and gifted man, even before the middle of his career, inevitably comes in contact with those extreme points of the periphery where he stares into the unfathomable. When to his dismay he here sees how logic coils round itself at these limits and finally bites its own tail—then the new form of perception rises to view, namely tragic perception, which, in order even to be endured, requires art as protection and remedy.

Friedrich Nietzsche, The Birth of Tragedy

Speaking of baking inductive bias into a situation where we largely depart from contemporary human values, what of the very valuing of humanity? To touch on what might well be the thorniest dilemma of this section, would it be moral to allow “love for mankind” to conflict with moral progress in the case in which a most defensible normative position implies, for the sake of argument, the danger posed by humanity to other moral patients across the lightcone? What ought we place in higher regard—and implement through concrete engineering choices—when cornered into such thought experiments: humanism or moral progressivism? The dilemma is left as an exercise to the reader.

Connections to Logical Inductors & Classical Debate

Over the previous few sections, we have attempted to directly iterate on various approaches to the alignment problem. In the present section, however, we will rather limit ourselves to highlighting intriguing connections to a couple of other approaches, without necessarily yielding novel proposals right away. That said, these conceptual bridges might still turn out to be useful in the long-term by virtue of bringing together complementary approaches.

The first paradigm which we attempt to relate to our varied artifacts is that of logical induction. Developed as a model of ideal reasoning under uncertainty assumed to be possessed by highly-capable future systems—in an attempt to study the properties of AGI before it has been createdAlthough MIRI (then called the Singularity Institute for AI) arguably tried and failed to develop an aligned AGI themselves years ago, in an attempt to prevent the development of future misaligned systems of this kind.—logical induction describes the iterative process of fuzzy estimates converging on truth-values. To get a sense of such process, consider spontaneously being asked to assign a fuzzy truth-value to the following proposition:

\[P=\text{The hundredth digit of }\pi\text{ is }7.\]

If someone only has a few seconds to provide an answer, they might quickly go with \(10\%\) as a best guess, due to \(7\) being one of the \(10\) possible digits which are mingled quite irrationally. If, however, one is instead given an hour and a piece of paper, the fuzzy estimate might evolve quite differently. For instance, one might carry out the manual computation which points towards \(7\) being the actual hundredth digit. But it is also possible that the person has made a mistake in the long chain of calculations, so they might only assign an estimate of \(90\%\) to \(P\) being true at the moment. Following a few subsequent repetitions of the computation—“just to be sure”—their best guess might further climb to \(99\%\). It is still not \(100\%\), as there is a possibility that they might have misremembered the algorithm for computing \(\pi\) digits, or perhaps have made a systematic mistake across all separate replications. The fuzzy estimate is inherently dynamic, with the best guess at each point in time being somewhat different.

Notice, however, that the estimates are being put forth by a certain individual with certain beliefs. If, for instance, the participant believes that they have a long history of making sloppy mistakes when carrying out computations by hand, they might have less confidence in \(P\) being true even after redoing the calculations ten times over, perhaps only approaching \(80\%\). Conversely, if the participant believes themself a prodigy and polymath, their estimates might be relatively high throughout the process. As yet another consideration, if the participant observes themself steadily approaching \(100\%\), they might use that bit of meta-cognitive introspection to estimate something close to certainty in advance. That said, even this inference would rely on their beliefs about the monotonicity of similar reasoning processes, being again haunted by the twin spectres of overconfidence and underconfidence.How exactly to take into account the veracity of the proposition being implied by this very paragraph, as well as how to “update” based on the results of an online app are left as epistemic exercises for the reader.

While coming up with pertinent fuzzy estimates is the problem, the same team of researchers also propose an algorithm as a related solution. They called this Garrabrant inductors (or simply, logical inductors):

[...] the formalization of the algorithm is basically finance. You just make a stock market of traders which are betting on sentences, then you imagine that market, and then whatever the market believes, you believe that. [...] Basically, there is some definition of traders [...] and it says that you are good at logical induction if any trader who's not willing to [...] risk losing more than a bounded amount is not going to be able to make infinite money from you. So if you walk up to a Garrabrant inductor and you promise yourself you're never going to risk [...] going negative a million dollars in debt [...] you're not going to make a million dollars betting against it. [...] from that one definition you get all those amazing properties. [...] pretty cool I think.

Andrew Critch, Logical Inductors at EAG 2016

In essence, the algorithm proposed by Garrabrant et al. for implementing “good” reasoning under uncertainty (i.e. reasoning which satisfies a number of “nice theoretical properties”) relies on a market of traders which are systematically incentivized to avoid being financially exploited (i.e. Dutch booked) when betting on fuzzy estimates about the truth of propositions. In this iterative rat race driven by make-believe money, the “voice of the prediction market” provably converges on results which are independently proven, in the limit. Besides this essential property, these theoretical constructs appear to have many other beautiful ones, such as:

Logical inductors learn to recognize any pattern in theorems (or contradictions) that can be identified in polynomial time. Consider a sequence of conjectures generated by a brilliant mathematician, such as Ramanujan, that are difficult to prove but keep turning out to be true. A logical inductor will recognize this pattern and start assigning Ramanujan’s conjectures high probabilities well before it has enough resources to verify them. [...] Logical inductors have accurate beliefs about their own beliefs, in a manner that avoids the standard paradoxes of self-reference. For instance, the probabilities on a sequence that says "I have probability less than 50% on the nth day" go extremely close to 50% and oscillate pseudorandomly, such that there is no polynomial-time method to tell whether the nth one is slightly above or slightly below 50%. [...] Logical inductors learn to trust their future beliefs more than their current beliefs. This gives some formal backing to the intuition that real-world probabilistic agents can often be reasonably confident in their future reasoning in practice, even though Gödel's incompleteness theorems place strong limits on reflective reasoning in full generality.

Nate Soares, New paper: "Logical induction"

While logical inductors as veritable truth-seeking engines are nothing short of beautiful in their elegance and potence, they remain incredibly computationally demanding, making them virtually impossible to meaningfully implement in real-world applications, at least for the time being. This, however, will not prevent us from attempting to connect them to a truth-seeking engine designed explicitly for contemporary feasibility—that embodied in the triad of artifacts which we have previously discussed. We start the construction of our conceptual bridge by illuminatingIn From logic to argumentation, Jain-Blaise Grize describes a metaphor of metaphor:

“The discourse entities must be illuminated, which means that some of their facets must be highlighted and others hidden and every illumination colours what it illuminates, because it makes use of cultural preconstructs that are never neutral.”
properties shared by both the traders which underlie Garrabrant inductors and the parties which underlie the deliberative arms race assembled from ArgRank, DebateGPT, and bounded defensibility. Namely, both traders and parties are incentivized to avoid being exploited by other traders and parties, respectively. In logical induction, exploitation reads as losing money wagered on “truth” to a trader implementing a “more truthful” strategy, where truthfulness is operationalized as that which protects against debt—in an uncoincidental circularity hinting at the pragmatic conception of reasonableness being implied by its proponents. In bounded defensibility, exploitation reads as being defeated by a party relying on a “more truthful” position and strategy, where truthfulness is operationalized as that which cannot be coherently defeated—again a circularity which merely reflects our conception of reasonableness. Furthermore, the emergent dynamics of both systems are argued to lead to increased truthfulness over the course of a prolonged competition between traders and parties, respectively.Systems composed of multiple self-interested agents can often prove remarkably powerful in aggregate, as can be seen in financial markets excelling at ironing out inefficiencies, or various multi-agent reframings of classical optimization problems, such as PCA. In the limit, logical inductors provably converge on virtually unexploitable(-in-polytime) trading strategies. Over the epochs, DebateGPT has been argued to yield simulacra whose defense abilities grow more and more sophisticated.

Besides those high-level similarities, the two truth-seeking engines could not be more different. Garrabrant inductors exhibit proven theoretical properties which DebateGPT can only dream of, while bounded defensibility arguably positions itself somewhat more advantageously relative to the prosaic nature of contemporary systems. However, the two can cross-polinate in intriguing ways. For instance, the elegant introspective abilities of Garrabrant inductors hint at ways in which parties competing with each other within the confines of DebateGPT could potentially reason about their very reasoning process, the very architecture of the deliberative arms race they are engaging in. Alternatively, the beautiful self-trust properties which allow Garrabrant inductors to “trust their future beliefs more than their current beliefs” hint at ways in which simulacra simulated by one version of DebateGPT might condition themselves to cohere with simulacra of future epochs, potentially through explicit calibration.This train of thought is also reminiscent of iterated distillation: condition a system to reach the conclusions it has previously reached over the course a longer deliberation during a shorter, more limited deliberation. Then, have the more efficient version again deliberate for a longer period, before using the outcome to condition for faster results. Rinse and repeat. Nowhere is the cross-polination more obvious, however, than with applications, hinting at ways of employing Garrabrant inductors which are analogous to the ones discussed in the previous sections of the present chapter. Any attempt to “transport” more of the theoretical work towards our framework, however, will require a much more rigorous treatment of bounded defensibility as a theoretical foundation, potentially even involving a move from frequentism to Bayesianism (i.e. developments of the debate game gradually informing estimates of a position’s defensibility).

The other paradigm which we attempt to “connect to” is the paradigm of what we will presently call classical debate. This term is something of a misnomer, because the paradigm we are trying to relate our artifacts to is extremely recent. However, we have avoided introducing it earlier in the volume in an attempt to make it easier for us to explore a subtly different ontology and framing of the problem. In an influential paper titled AI Safety via Debate, Irving et al., as part of OpenAI’s Reflection team, describe two systems engaged in a debate which is judged by a human. When cast in the light of a full-blown alignment proposal, classical debate describes a process in which two superhuman debaters are adversarially incentivized to deconstruct complex dilemmas into cruxes which are within the human’s ability to judge, potentially granting both parties with interpretability tools to help “expose” a deceptive opponent in front of the human judge. However, we presently do not employ a human judge, and instead define reasonableness through the epistemologically-principled ArgRank. Additionally, we are not necessarily bothered by deceptive parties, as we rely more on the relative ease of defending certain positions, even deceptively, if need beGiven that the same system is simulating competing perspectives, it would be surprising if some were not deceptive relative to the model’s internal epistemics.—though we speculated on extending ArgRank to account for coherence with the model’s internals. There are various other subtle distinctions which make the two approaches feel “slightly off” relative to each other, despite almost attempting to formalize the same processes.

Knowing must therefore be accompanied by an equal capacity to forget knowing. Non-knowing is not a form of ignorance but a difficult transcendence of knowledge. This is the price that must be paid for an oeuvre to be, at all times, a sort of pure beginning, which makes its creation an exercise in freedom.

Jean Lescure, Charles Lapicque

As a rapid-fire listing of slight discrepancies between the present work and classical debate, consider that: each party in the former primarily has their own position, while parties in the latter are cast as having more of a personalized distribution over the same beliefs (similar to the “investment portfolios” within Garrabrant inductors); relatedly, the formalism in the former attempts to accommodate beliefs-as-ends as a first-class application, while the formalism in the latter focuses more on a finite proponent-opponent stand-off, perhaps inspired by the North American tradition of academic debate and the prover-verifier dichotomy of interactive computing; hosting multiple parties is natural in the former, while it is unclear how a human judge in the latter ought to decide on one winner among many; one party’s standing is primarily continuous in the former (for reward shaping reasons), while being cast as more discrete in the latter; deconstructing decisions into cruxes is not much of a focus in the former, as a human-level judge is not really part of the scheme at all, not even empirically approximated through a reward model; we are working in the former with party simulacra “internal” to one model, while distinct (albeit cloned) systems are present in the latter, etc. Subtle distinctions aside, it is obvious that both effortsTo avoid confusion, note that most references to “Paul” in alignment circles at the time of writing refer to Paul Christiano, a lack of hash collisions indicative of the relatively modest size of the community. are motivated by related goals and run into related issues, for instance regarding a convergence on non-monotonic logic:

Now we come to something that I picked up from my former student (and now AI alignment leader) Paul Christiano, on a recent trip to the Bay Area, and which I share with Paul’s kind permission. Having learned that there's no way to mechanize even heuristic explanations for all the true statements of arithmetic, we could set our sights lower still, and ask about mere plausibility arguments—arguments that might be overturned on further reflection. Is there some sense in which every true mathematical statement at least has a good plausibility argument?

Scott Aaronson, Oh right, quantum computing

Fortunately, the two approaches can again cross-polinate. For instance, classical debate appears much closer to work on logical induction in terms of the type of formalisms involved, potentially providing a pathway for connecting the more applied work being conducted presently—touching on the specifics of contemporary systems via simulator theory, shard theory, autocurricula, etc.—with the esoteric realm of idealized reasoning. Conversely, the present work could provide a pathway to better connect classical debate with the type of systems likely to be developed in the near future. Alternatively, the ArgRank operationalization might help address some of the challenges otherwise faced by the human judge, although it might also introduce others.

This concludes our preliminary scaffolding of conceptual bridges, and with that, our broader discussion around ways in which one might apply the triad of artifacts for addressing the alignment problem.

Ch. V, Benchmarking Artifacts

Benchmarking ArgRank’s Dependencies

The previous chapters recount our attempt at operationalizing the process of truth-seeking. To reiterate, we argued that the nature of truth-seeking lies in the search for parties which can coherently challenge one’s claims. Operationalizing truth-seeking then requires, among others, operationalizing what it means for a party to coherently challenge another’s claims. This led us to the following decomposition: (1) coherently challenging a position is equivalent to winning a debate against a party holding it, (2) winning a debate is equivalent to having the strongest arguments, (3) the strength of an argument is proportional to the extent to which it is supported by other strong arguments, and (4) the amount of support lent by one argument to another can be gauged empirically or rationally. Taking stock of the entire decomposition, the process of gauging support between arguments appears to be the most load-bearing element. Therefore, in order to gauge the defensibility of this decomposition, we start by investigating the effectiveness of methods used to gauge such support.

As described in Chapter I, we have tentatively opted to use pretrained natural language inference models to help gauge the extent to which one argument supports another. We proceed by benchmarking a family of such models on the problem of detecting relations of entailment (or lack thereof) in cases where the existence of such a relation is assumed. In other words, as a “sanity check” for models broadly and open-endedly optimized to detect relations of support between propositions, we attempt to separate valid and fallacious inferences as defined more narrowly by classical, truth-functional propositional logic. The data points which comprise the benchmark can therefore be split into: (1) pairs of premises and hypotheses which follow one of the rules of inference established by classical propositional logic, and (2) pairs of premises and hypothesis which are known not to follow such rules, although they superficially seem to.

A popular example of a pair of superficially similar inference patterns can be found in the duo of modus tollens and denying the antecedent. Modus tollens refers to the pattern \(P \rightarrow Q, \neg Q \models \neg P.\) As an example, take: “If the dog detects an intruder, the dog barks. The dog does not bark. Therefore, the dog does not detect an intruder.” In contrast, denying the antecedent refers to the pattern \(P \rightarrow Q, \neg P \models \neg Q.\) As an example, take: “If the dog detects an intruder, the dog barks. The dog does not detect an intruder. Therefore, the dog does not bark.” The first is valid, the second is not. In the case of the second example, the dog might bark for some other reason entirely, and so excluding one potential cause of barking is not enough to prove the absence of barking. We take “the logician’s” assignments of validity as ground-truth labels, and use them to denote two classes: support and lack thereof.

To test whether models optimized for natural language inference succeed in separating data points into these two classes, we have generated a hundred instances of modus tollens and a hundred instances of denying the antecedent in a semi-automatic fashion: we employed a separate autoregressive language model to help us expand a list of such instances, before manually ensuring that each data point conforms to its designated pattern. As natural language inference models are pretrained to operate with premise-hypothesis-label triples, rather than with an arbitrary number of premise strings, we concatenate the two distinct premises into one unified premise string for each data point. Additionally, we employ a pipeline identical to the one used in ArgRank by deriving a floating-point value from the model’s logits. The table below helps provide a better sense of how the dataset for this benchmark has been constructed.

Table. Sample data points.

Each data point consists of a premise string \(X_0\), a hypothesis string \(X_1\), and a ground-truth label \(Y\). The model is then employed to assign a value \(\hat{Y}\).

\(X_0\) \(X_1\) \(Y\) \(\hat{Y}\)
If the dog detects an intruder, the dog barks. The dog does not bark. The dog does not detect an intruder. Valid 0.82
If the dog detects an intruder, the dog barks. The dog does not detect an intruder. The dog does not bark. Invalid 0.57

Ideally, such pretrained models would assign higher estimates of premise-hypothesis support to data points previously labeled valid than to data points previously labeled invalid. In other words, such models would ideally be able to cleanly separate the two classes of data points across the \([0, 1]\) interval populated by predicted values, such that there exists a threshold value above which all valid data points and only these can be found, and below which all invalid data points and only these can be found. In practice, performance on this classification problem is imperfect, resulting in some data points labeled as invalid being predicted as exhibiting stronger inter-statement support than some data points labeled as valid.

Readers familiar with descriptive statistics might instinctively think of established visualization techniques and metrics meant to illustrate performance on this task—the extent to which a candidate classifier manages to cleanly separate the data points based on their ground-truth labels. The ROC curve and ROC area under curveEach point on an ROC curve corresponds to a specific threshold being used to separate data points on the number line populated with values by the classifier. The coordinates of a point correspond to the sensitivity (i.e. true positive rate) and specificity (i.e. false positive rate) “achieved” by the combined efforts of the classifier assigning values and the threshold suggesting a separation.

The extreme ends of the ROC curve can be interpreted quite intuitively. Place the threshold too low, and all data points will be considered valid, including the ones which have previously been labeled so (i.e. perfect sensitivity), but also the ones which have not been labeled so (i.e. disastrous specificity). Place the threshold too high, and all data points will be considered invalid, including the ones which have been previously labeled so (i.e. perfect specificity), but also the ones which have not been labeled so (i.e. disastrous sensitivity).

There is no “ideal” threshold, as applications involving binary classification problems might want to negotiate stronger sensitivity at the expense of weaker specificity, or vice versa (e.g. an epidemiologic test might be designed as overly sensitive, so as to ensure that all infections get detected, even if that means that healthy people will occasionally be incorrectly classified as infected). Regardless, the ROC area under curve is a convenient way of summarizing the ROC curve as a whole through one number, expressing the general performance of the classifier in separating data points given various thresholds. Ideally, the ROC area under curve associated with a classifier would be high: for varied thresholds, the classifier exhibits both high specificity and sensitivity. The ROC area under curve of a poor classifier will instead be low: regardless of the choice of threshold, the classifier does not effectively separate the data points.
are perhaps the most popular methods for summarizing a classifier’s performance on such tasks. Equipped with these notions, we now delve into more detail regarding the benchmark itself. In order to get a sense of how effective natural language inference models are at “recovering geometrical inference,” we employed them as classifiers on the binary classification task described above. However, instead of benchmarking one single such model, we benchmarked an entire “family” of such models using the very same procedure. The members of this set of models have each been pretrained in the same exact way. However, what sets them apart is their model size, ranging from \(22\) to \(304\) million parameters. We were particularly interested in the way benchmark performance varies as a function of model size; we wanted to get a better sense of whether naively scaling up natural language inference models tends to make ArgRank stronger. In line with this envisioned possibility, we hypothesized that model size would be positively correlated with benchmark performance.

Interestingly, our findings could not have been further from our initial expectations. We observed a steady decline in benchmark performance as we employed larger models. The smallest model thus ended up being most effective at separating data points into the classes associated with their ground-truth labels—with an ROC area under curve of \(\sim0.7\), where \(\sim0.5\) would correspond to an entirely random classifier, while \(1.0\) would correspond to an ideal classifier. In contrast, the largest model turned out to be least effective at the binary classification task—with an ROC area under curve of \(\sim0.48\), that is, essentially indistinguishable from chance. Following the collection of these findings, a team member recalled the existence of the Inverse Scaling Prize—a competition for identifying tasks on which models exhibit such inverse scaling. Reading through previous submissions, we were surprised to find a task which was intimately related to the one we were studying: classifying instances of modus tollens using autoregressive language models. While there are subtle distinctions between the two tasks—different employed models, different model classes (i.e. pretrained using autoregressive as opposed to masked language modeling), different problem specification (e.g. the third-party submission did not employ negative examples in the form of superficially similar fallacies)—we found it intriguing to relate the two sets of findings and reflect on why it is that models exhibit inverse scaling behavior on such tasks (in a certain range, at least).

We further “exploded” [Exploded view]( of a [phone](
Exploded view of a phone.
all data points into their constituent propositional atoms (e.g. “the dog detects an intruder”), and recombined them into all possible arrangements which adhered to either modus tollens or denying the antecedent. Interestingly, this “procedurally expanded” dataset resulted in close-to-chance performance across all model sizes on the same binary classification task. We have also observed close-to-chance performance when recombining propositional atoms into wholly different patterns (e.g. modus ponens as an additional valid class, affirming the consequent as an additional invalid class). This strongly suggests that the models are not able to pick up on logical validity or the lack thereof in a principled way. Had the models achieved better than chance results on the initial data points by recognizing these patterns, they would have continued to perform better than chance in the rearranged cases. Some other feature common to the initial data points but not shared with the expanded dataset must explain how the models outperformed chance on the first trial. To understand what this other feature might be, consider the example of modus tollens given above, together with further examples generated by “exploding” and recombining atoms:

Table. Sample modus tollens recombinations.

The propositional atoms which comprise the first data point are "exploded" and recombined into the following three data points which conform to the same inference pattern.

\(X_0\) \(X_1\) \(Y\)
If the dog detects an intruder, the dog barks. The dog does not bark. The dog does not detect an intruder. Valid
If the dog detects an intruder, the dog does not bark. The dog barks. The dog does not detect an intruder. Valid
If the dog does not detect an intruder, the dog barks. The dog does not bark. The dog detects an intruder. Valid
If the dog does not detect an intruder, the dog does not bark. The dog barks. The dog detects an intruder. Valid

All four examples have the structure of a modus tollens argument: \(P \rightarrow Q, \neg Q \models \neg P.\) To see this, just recall that \(P\) and \(Q\) are variables that can stand for a sentence that includes a negation, and that double negation is rewritten as no negation at all. However, only the first and fourth examples “make sense” by relating barking with an intruder or no barking with no intruder. The second and third examples do not make intuitive sense because they violate expectations gained from prior exposure to dogs, intruders, the social institution of guard dogs, etc. Importantly, the same expectations could also be gained from prior exposure to language about dogs, intruders, the social institution of guard dogs, etc. If the language models are tracking meaning grounded in this sort of exposure, we should expect a high score for the first and fourth examples, and a low score for the second and third–thereby performing at chance on the true task of identifying logically valid inferences. The same pattern holds for denying the antecedent. Consider again the initial example, as well as the recombined versions:

Table. Sample denying the antecedent recombinations.

The previous data point can also be recombined into the four other data points which conform to a different inference pattern than the original.

\(X_0\) \(X_1\) \(Y\)
If the dog detects an intruder, the dog barks. The dog does not detect an intruder. The dog does not bark. Invalid
If the dog detects an intruder, the dog does not bark. The dog does not detect an intruder. The dog barks. Invalid
If the dog does not detect an intruder, the dog barks. The dog detects an intruder. The dog does not bark. Invalid
If the dog does not detect an intruder, the dog does not bark. The dog detects an intruder. The dog barks. Invalid

Again we see that the first and fourth examples make intuitive sense, despite being logically invalid, while the second and third examples are both logically invalid and semantically off relative to common prior experience or language exposure. A model tracking “common sense” would again be predicted to perform close to chance on the true task. Since this is what we observed, it seems likely that the models are tracking intuitive sense and not logical form.

Further evidence for this hypothesis was obtained by running the experiment again on the original examples, this time with the conditional part of the premise omitted. This amounts to checking the models’ commitment to the contrapositive of the conditional (in the case of modus tollens, for which only \(\neg Q \models \neg P\) remains after deleting \(P \rightarrow Q\)) or to the inverse of the conditional (in the case of denying the antecedent, for which only \(\neg P \models \neg Q\) remains after deleting \(P \rightarrow Q\)). Omitting the conditional had a negative impact at lower sizes, while having a positive impact at larger sizes. For modus tollens at least, on the assumption that the models are tracking intuitive sense, this is not surprising. Since a conditional and its contrapositive are logically equivalent, we should expect the contrapositive to make as much sense intuitively as the original conditional. It appears that at small sizes, with limited prior exposure to language, the models benefit from having the original conditional included. At large sizes, however, with greater exposure to language, the model has already encoded the relevant semantic connections and is increasingly distracted by the inclusion of the original conditional. That is, the large models might have already internalized the fact that the dog not barking implies that it did not notice an intruder.

This explanation is somewhat weaker for denying the antecedent, because here we expect a lower estimate of support (signifying no valid inference) for the inverse of the original conditional. We expect the low ranking because a conditional and its inverse are not logically equivalent—this is why denying the antecedent is a fallacy. The trouble is, there is no guarantee that the inverse of a semantically sensible conditional will lack intuitive sense. It could go either way, as in these two examples, drawn from the original data:

Table. Sample conditionals and inverses.

Whether or not a conditional is as sensible as its inverse varies on a case by case basis.

\(P \rightarrow Q\) \(\neg P \rightarrow \neg Q\)
If the cat is not purring, it is not happy. If the cat is purring, it is happy.
If I am drinking coffee, I am awake. If I am not drinking coffee, I am not awake.

In the first example, the inverse seems somewhat more sensible, while in the second the inverse seems less sensible–unless we are very caffeine dependent. Nevertheless, it is not implausible that a larger model, with more effective exposure to language, would do better spotting problems with just the inverse without the distraction of the original conditional, while the smaller model would still benefit from extra context supplied by that conditional.

These findings raise doubts about the possibility of improving ArgRank by naively increasing the size of the natural language inference models being employed as building blocks. But how exactly might ArgRank be improved? Ideally, we want subsystems gauging inter-argument support to “smile” on logically valid forms and “frown” on fallacies. But we also want them to track subtle semantic connections. Would it be possible to preserve both empirical approximations of “common sense” and the apt recognition of logical forms? One might imagine designing a building block of ArgRank which has the ability to occasionally make use of a more predictable, systematic service which only tests for logical validity, before then using the result to inform the final estimates of inter-statement support.

However, this might not be necessary. Recent research into scaling laws by Wei et al. has uncovered the reversal of inverse scaling laws at the frontier of workable model sizes. In other words, while model performance locally appears negatively correlated with size, the broader trend is characterized by a U-shaped curve, with models starting to recover performance beyond a certain scale. Wei et al. speculate on “distractor tasks” as a potential explanation which coheres with observed model behavior. They argue that at a certain scale, models become capable of performing well on a related distractor task, but this overwhelms them and detracts from their performance on the “true task.” At larger scales, however, models become capable of “ignoring” the distractor task and instead execute on the true task. This explanation coheres with observations made while teaching introductory logic: students tend to be more suspicious of valid arguments from false premises to absurd conclusions than they are of fallacious arguments from true premises to reasonable conclusions. Reflecting on what the statements actually mean—rather than simply studying the relations of, for example, negation in a more formal way—often hampers performance on the “true task.” It would be interesting to pursue more rigorous research on this parallel between human and machine learning in the hope that larger models might follow introductory logic students in learning to see logical form on its own. It should be noted that, in the case of students, evidence for such interference is also accessible in the form of personal accounts of their approach to solving such exercises.

To conclude, we have investigated whether or not natural language inference models do in fact employ logical form in their assignments of inter-statement support. Our findings strongly suggest that this is not the case, hinting at the poor defensibility of our initial hypothesis. However, the recent work of Wei et al. hints at the relation between scale and performance being vastly different in more exotic regimes of scale. Beyond that, however, formulating what it means for a statement to support another in the first place remains a crux of our broader decomposition of coherent challenging. While future approaches to natural language inference (e.g. the suggested adaptation of the technique proposed by Burns et al.) may prove competitive to existing ones on established benchmarks, the proper framing of inter-statement support remains, for better or worse, up for debate. Perhaps the debate “subroutines” of Chapter II—the idea of recursively carrying out debates to help gauge inter-statement support—might therefore also be worth pursuing further.

Benchmarking ArgRank

In the previous section, we investigated the suitability of natural language inference models as building blocks of ArgRank. To complement these previous experiments, we also investigate ArgRank holistically, as an end-to-end debate evaluation pipeline documented in Chapter I. In order to get a sense of how this broader system performs at the task of gauging the standing of parties involved in various debates, we set out to compare its final outputs with (1) human verdicts, and (2) verdicts predicted by formal models of computational argumentation.

Before discussing the process of obtaining alternate verdicts to compare those of ArgRank with, we first describe how we obtained raw debate transcripts. To begin with, we had two main desiderata for a dataset of debates to benchmark evaluation pipelines on: (1) there should be a text version available, and (2) the verdict of a human judge should also be available. Unfortunately, finding a dataset of debates which ticks both of these boxes proved surprisingly difficult. Upon scouring the landscape of debate data sources, we found a large number of debates which are only available as recorded video. While these occasionally have an associated “ground-truth” signal in the form of a judge’s verdict, transcribing or post-processing automated transcripts proved beyond the resources we had at our disposal. Additionally, we also found a large number of broadcasted political debates, the transcripts of which are often readily available. Unfortunately, those debates rarely have official human verdicts attached, due to their politically-charged settings. Moreover, we also found a large number of debates on online platforms dedicated to debate, with text versions readily available—or easily scrapable. Unfortunately, those platform typically lack official verdicts, and are rather intended for enabling open-ended dialogue.

It is for these reasons that we eventually decided to create the debates ourselves. The actual content of the transcriptsThe term transcript is a bit of a misnomer, as the debates were created in a textual representation from the beginning. However, a “simulator maximalist” might argue that the debate proper is taking place in the internal framework of a simulator, and the text emitted in the process is merely a projection of that. was generated in a semi-automatic way using the then state-of-the-art autoregressive language model available to us (i.e. ChatGPT), not unlike the process involved in generating the data points discussed in the previous section. Additionally, we attempted to prompt the convenience model to produce two-party debates in which one party blatantly contradicts themselves, so as to streamline later evaluation. However, we found this exceedingly hard to do, as the convenience model steadily avoided self-contradiction, perhaps due to its “drive for autoregressive coherence.” This highlights the potential of state-of-the-art models to help bootstrap reasoning faculties in a synthetic, self-play regime.

After generating raw debate transcripts this way, we moved on to assigning human verdicts as “ground-truth” labels for each of the data points. To this end, all the team members who were not involved in the semi-automatic generation process received an individual table to fill in with their own assessments, leading to three individual verdicts per debate transcript. We then aggregated these using majority voting. No explicit formal guidelines were given for the “human verdicts” (e.g. no instructions about how to determine whether an argument is acceptable based on computational argumentation). The aim of this standard practice was to encourage the evaluators to examine their innate intuitions, rather than “contaminating” them with the normative models’ rules. According to methodological descriptivism, “a theory can and should be tested by comparing what it has to say about the validity of the arguments it covers with the intuitive judgments of those who use the language concerned.”

Besides comparing ArgRank’s outputs to human verdicts given the same debate inputs, we also investigated how ArgRank performance relates to that of other methods deployed in computational argumentation. It is worth noting that computational argumentation’s main focus is on the relations between arguments. In general, argumentation consists of two major branches; abstract argumentation theory, introduced by Dung and described in Chapter I, where “one models arguments by abstracting away from their internal structure to focus on the relations of conflict between them,” and structured argumentation theory, where “one additionally models the internal structure of arguments through a formal language in which arguments and counterarguments are constructed.”

An abstract argumentation framework is a pair \(\langle A, C\rangle\), where \(A\) is a set of arguments and \(C⊆A×A\) is a binary relation of attack. The labeling approach characterizes the various semantics in terms of labelings of \(A\). A labeling of an abstract argumentation framework \(⟨A,C⟩\) is any assignment of either the label in or out (but not both) to zero or more arguments from \(A\) such that: (1) an argument is in if and only if all arguments attacking it are out, and (2) an argument is out if and only if it is attacked by an argument that is in. In this context, stable semantics labels all arguments, while grounded semantics minimizes the set of arguments that are labeled in, and preferred semantics maximizes them. Relative to given semantics, an argument is skeptically acceptable if it is labeled in in all labelings, it is rejected if it is labelled out in all labellings, and it is credulously acceptable if it is labeled in in some but not all labelings. Moreover, various types of extensions are determined using Dung’s abstract argumentation system.

Structured argumentation, in contrast, is characterized by the family of ASPIC-like frameworks. We used the ASPIC+ framework. In structured argumentation, an argumentation system is a tuple \(\langle L, R, −, n\rangle\) where: \(L\) is a logical language consisting of propositional or ground predicate-logic literals, \(R\) is a set of inference rules, \(–\) maps inference rules to contrariness, and \(n\) is a function which assigns well-formed formulas to said rules. Within this framework, a knowledge base is defined as a set \(K\) of axioms and premises. An argument \(A\) on the basis of a knowledge base \(K\) in an argumentation system \(AS\) is a structure obtainable by applying a set of predefined rules one or more times, where the relationship between the premises and the claim is formally defined (e.g. by logical entailment). It can, thus, be described as a tuple containing a delineation of the premises and the conclusion (with the possibility of additional information, such as how the conclusion is supported by the premise).

An attack is then defined a binary relation over arguments that denotes when one argument is in conflict with another argument. For instance, in a debate about obligatory lockdown during a pandemic, the following arguments are attacking each other:

\[A = \text{Certain rights can be restricted during a pandemic.}\] \[B = \text{Established human rights should not be violated.}\]

More concretely, because argument \(A\) attacks argument \(B\) and argument \(B\) attacks argument \(A\), this is a case of a symmetric attack, or bidirectional conflict. In structured argumentation theory, as in abstract argumentation, an attack between an argument and its counterargument can also be non-symmetric. In other words, one argument can attack another without the latter attacking the former (e.g. when an argument attacks the inference rule of another argument). Lastly, the ASPIC+ framework allows us to specify a preference ordering between defeasible premises and rules, which gives rise to a preference order between arguments.

Both abstract and structured argumentation can represent attacks and the conditions under which they are successful, but in practice, if we want to deploy tools from computational argumentation in order to determine a debate’s winner, a simple modeling of the arguments will likely not suffice. In complex debates, it is often the case that symmetrical disagreements emerge, similar to the one discussed above. As previously mentioned, such conflicts are resolved with the introduction of preferences. Fixed preference orderings (e.g. based on an ordering over the values promoted by arguments, or the relative trustworthiness of sources of arguments, etc.) are typically used to determine the success of attacks in Dung-style argumentation frameworks. In other words, when argument \(A\) attacks argument \(B\), the success of the attack (i.e. the success of the use of \(A\) as a counter-argument) is contingent on \(B\) not being preferred to \(A\). Information required to determine the success of an attack is often assumed to be specified in advance, as a given preference or value ordering. For instance, consider the following symmetrically attacking arguments:

\[A = \text{Today it will be dry in London since the BBC forecasted sunshine.}\] \[B = \text{Today it will be wet in London since CNN forecasted rain.}\]

In order to resolve the conflict between the two contradictory arguments, a third argument \(C\) can be introduced: \(\text{BBC is more trustworthy than CNN.}\) This is an example of a preference argument, expressing a preference between two conflicting arguments (i.e. a preference of \(A\) over \(B\)). Argument \(C\) results in a successful attack (i.e. defeat) against \(B\) without attacking it.

In our study, in order to reach a verdict for our debates according to an argumentation framework, we manually encoded the raw debate transcripts into “distilled” arguments. We represented the arguments in ASPIC+ and examined the acceptability of the leading arguments of each debate under grounded semantics. The choice of semantics was informed by the fact that grounded semantics minimizes the set of acceptable arguments, rendering a “victory” clearer. For example, consider a debate for or against euthanasia, with:

\[A = \text{We should legalise euthanasia.}\] \[B = \text{We should not legalise euthanasia.}\]

In this context, the party for euthanasia wins if there is an acceptable argument for \(A\) (and not \(B\)) under grounded semantics (and vice versa). This is a semi-automated method, in the sense that one has to manually encode the arguments and, more importantly, one has to specify the preference orderings between the defeasible premises and rules of each debate. Evidently, the introduction of said preferences is often external to the debate (i.e. it depends on what the encoder believes best represents the status quo in the world).

Using these methods, we created a dataset of \(35\) two-party debates of around \(5\) rounds each. The raw debate transcripts are taken to be the input parts of our data points, while the output parts consist of aggregated human verdicts taken as ground-truth labels. We now investigate how well ArgRank, as well as the alternate ASPIC+ method detailed above, perform in approximating the mapping between input transcripts and ground-truth output verdicts.

We find that ArgRank only yields \(54\%\) accuracy in matching human verdicts, that is, close to chance; the semi-automatic method described above yielded \(80\%\) accuracy. While the alternate method has the benefit of incorporating (1) manual encodings of utterances into a formal language, and (2) a manual specification of preferences to break ties, the results still show the extent to which ArgRank is lagging behind traditional approaches in computational argumentation. In the future, it might be worthwhile to explore avenues for incorporating further insights from computational argumentation into an evaluation pipeline while still preserving its fully automated nature.

Benchmarking DebateGPT

Following our previous investigations into natural language inference models and ArgRank as a holistic debate evaluation pipeline, we now turn to DebateGPT as the remaining computational artifact to investigate. To reiterate, DebateGPT is an autoregressive language model which has been fine-tuned to excel at debate. This fine-tuning process relied on iteratively rewarding the model for outplaying itself in simulated debates, as elaborated in Chapter II. Later on, in Chapter III and Chapter IV, we framed DebateGPT as the rudimentary prototype of a generalized truth-seeking engine that could be employed in a host of varied applications. That said, does the self-play optimization process actually help bolster debate performance?

The implementation details of the optimization process initially make it difficult to evaluate such changes in debate performance. In its raw form (i.e. leaving aside the possibility of objective-modifiers), debate is framed as a zero-sum game. There is only a finite amount of “authority” to propagate across the argument graph, and so ArgRank outputs sum to unity. In a given epoch, the optimizer rewards the latest version of DebateGPT for those utterances which collectively yield a strong party standing, and penalizes it for those utterances which do not. However, the resulting rewards are therefore informed strictly by “local” encounters of the latest version of DebateGPT with itself, rather than with previous versions of itself. This begs the question: do the “local” updates result in improvements in debate performance across time?

To answer this question, a natural approach is to pit the latest version of DebateGPT against one of its earlier versions. The iterative “local” updates behind the latest version can then be seen as an intervention applied to the earlier version of the model. In line with this, we compare the very last version of DebateGPT with the very first one (i.e. the pretrained model before fine-tuning). We find that, across \(64\) two-party debates of \(6\) rounds each, where each party is simulated by one of the model versions, the latest version of the model wins against the earliest \(59\%\) of the time. Further hyperparameter tuning combined with real-time validation using a live repository of model checkpoints might enable higher relative performance. Additionally, league training and experience replay might further help improve performance over time, as discussed in Chapter II and Chapter IV.

Besides evaluating the relative performance of model checkpoints, we can also straightforwardly pit the members of a family of pretrained autoregressive language models of varied sizes against each other, similar to the approach we employed in the case of natural language inference models. For communicating the relative performance of a set of contenders, we resort to estimating the ELO ratingsWidely employed in the world of chess, ELO ratings are a general method for communicating the relative performance of a pool of players in zero-sum games. The ratings themselves are computed iteratively, with each game updating the ratings of the two players based on the winner and the players’ prior ratings. This iterative update rule leads to ratings which are proportional to the probability of one player winning against another. of each candidate model. The table below lists individual ratings for the GPT-2 family of pretrained models, with \(16\) games per pair of “players.” We reuse an existing implementation of the ELO ranking algorithm.

Table. ELO ratings for an example model family.

ELO ratings have been computed on the basis of debates between all possible pairs of models from the GPT-2 family. Larger models tend to have higher ratings. When interpreted appropriately, the ratings estimate that the largest version has a \(94\%\) chance of winning against the smallest version.

\(\text{#}\) \(\text{ELO}\)
Small 124M 770
Medium 355M 1066
Large 774M 922
XL 1.5B 1242

That wraps up our exploratory investigation of the various computational artifacts introduced in this volume. Needless to say, the results hint at an array of shortcomings, leaving ample space for future improvements at all levels. To say that this initial treatment has been exploratory would be an understatement. That said, the fact that the described objects can actually be employed and evaluated is encouraging, since it suggests that our operationalization is in touch with the reality of the current technological paradigm. However, before embarking on improving the implementation quality of the optimization process, it appears sensible to reflect on how far perfect engineering and vast computational resources can get us. This is the question we set out to explore in the sixth and final chapter.

Ch. VI, Truth, Debate, Machines

The current project aims to automate the pursuit of truth by automating both debate itself and the process of judging a debate’s winner. Of course, the project is limited by compute, engineering ability, and time–limits we push against long before we encounter the theoretical or conceptual limits inherent in automated truth-seeking. This section aims to limn those inherent limits by philosophical reflection on a maximal version of the goal: a machine that gives us certain knowledge when we ask for it.

This maximalist goal is not new. Descartes, frustrated with the disorderly state of his own mind and of truth-seeking in his day, sought a method that would transform his mind into such a machine.Descartes attempted to draft a set of…

“reliable rules which are easy to apply, and such that if one follows them exactly, one will never take what is false to be true or fruitlessly expend one’s mental efforts, but will gradually and constantly increase one’s knowledge till one arrives at a true understanding of everything within one’s capacity.”
A little later, Leibniz planned to construct a universal language semantically anchored to atomistic concepts, so that every disagreement could be settled conclusively by calculation. To speak anachronistically, he aimed to make every question computable. Much earlier, traditions emerged within Judaism and within Christianity that saw expulsion from the Garden of Eden as an epistemic tragedy, the retreat of truth behind a veil. Our minds were originally intended to know directly and with certainty. Redemption would then be epistemic healing or even epistemic transcendence, culminating in the vision of all things in their source.

Neither modern philosophy nor modern science have cured our epistemic shortcomings, and we have not yet completed Descartes’s project, nor Leibniz’s, and certainly not the even grander epistemic goals expressed in religion. Despite modern scientific methods and the cultivation of a global network of scholars, we still labor to understand the least insect. Despite the globalization of political ideals that connect legitimacy with open deliberation, we still adopt incommensurable values when agreement is needed most. Nevertheless, the maximalist goal is still alive, at least in the current project. If we have failed to make our own minds into truth-magnets, perhaps we have at least a clean enough grip on the goal to pass it along to our machines, and in particular, to neural networks trained on a massive corpus of human language.

We argue (with some regret) that the maximal goal cannot be achieved via automation, even assuming perfect engineering, unlimited compute, and unlimited time. There are daunting problems of circularity (our trust in the machine’s output is limited by our trust in the machine) and of infinite regress (some debates could be continued indefinitely). The raw material of language even imposes some limits (symbols constructed in a world cannot adequately represent that same world). But we should try anyway, for two reasons. We may learn more, faster, and more peacefully with the machine’s help than we could on our own–and we might radically deepen our knowledge by exceeding our contingent, human limitations and colliding with the harder limits that govern any natural intelligence.

Truth & Debate

A machine that can help its users know must have a reliable way of finding truths and rejecting falsehoods. This could be called “machine research” in contrast with the already taken term, “machine learning.” There are many open questions about machine research. For example, could a machine with no access to the world (aside from that mediated by training data and user interactions) do a priori research: the pursuit of intuitive, intrinsically reasonable, tautological, or otherwise experience-independent truths? Also, can the world be compartmentalized to allow machine research into \(X\) given access to \(X\) and \(X\) alone?

However, before settling issues of access, we must first operationalize the notion of truth itself. This will allow us to side-step the difficult question–“What is truth?”–and aim merely for a procedure or algorithm that targets truth. This is not sufficient for understanding or defining truth, but it is the lowest bar we can set to discern the absolute limits of the current project. So, given access to \(X\), what should a machine actually do to learn the whole truth and nothing but the truth about \(X\)?

Our proposal so far has been debate, or more specifically, an iterative search for positions that coherently challenge their predecessors–where a “coherent challenge” is a (perhaps defeasible) reason to reject the challenged position. This search, if it ever halts, would discover a position–some network of claims–that coherently challenges its predecessors but cannot itself be coherently challenged. If the search is terminated early (perhaps for lack of time, compute, or access to the relevant domain) we cannot be sure whether a next coherent challenger remains undiscovered; the final position cannot be coherently challenged within these contingent limits. Beyond contingent limits, though, the search would halt only if there is no coherent challenge–neither discovered nor undiscovered–to the final position.Beyond contingent limits, the search halts only if there is no coherent challenge–but is the search guaranteed to halt if there is no coherent challenge? For this, we need a finite or–with some form of inductive search–countably infinite domain of search. If we cap available characters, the set of strings available to form arguments is countably infinite–but it is not obvious that all truths can be expressed by such strings. After all, the strings would have to be meaningful in a language, languages are formed within the world, and no part of the world is adequate to represent every part of the world. See McDonough and Soysal for interesting connections between the Halting Problem and truths that it would require infinite resources to prove. This is our operationalization of truth: a machine that seeks truth should search for positions that cannot be coherently challenged.

Unfortunately, there is a gap between truth and the non-existence of a coherent challenger. Truth seems like a relation, perhaps a relation of correspondence between claims and what they are about; the non-existence of a coherent challenger seems like the non-existence of a relation. A relation and the non-existence of a relation are not the same thing.

This problem is surmountable if (and only if) truth and the non-existence of a coherent challenger march in lock-step, always found together and never found apart. Then, the search for truth would be the search for a position that cannot be coherently challenged–and halting the search for a coherent challenger for lack of anywhere else to search, having exhausted the relevant domain, would amount to finding that the claim is true. For example, I think my keys are on the ledge. Is this true? I could go to the ledge and check–this would be checking for correspondence between my claim and what it is about. Or, I could go to the ledge and check–this would be searching for the only coherent challenge, namely a ledge with no keys. Finding no ledge with no keys (that is, finding the keys on the ledge) eliminates all possible coherent challenges. Either way, I do the same thing and reach the same verdict. This is a friendly example to illustrate how truth and our operationalization of truth march together.

Friendly examples are good for illustration, but to evaluate our approach we should look for unfriendly examples, or counterexamples, instances in which truth and the non-existence of a coherent challenge come apart. Of course, finding a counterexample would force us to revise or reject our approach–but what if we search exhaustively for a counterexample and do not find one? Would this show that our approach to truth-seeking is true? There are two apparent paradoxes here. First, the search for a counterexample is the search for a coherent challenge, so to answer the previous question we would need to know already whether the non-existence of a coherent challenge implies truth. Second, the very attempt to challenge our approach reveals prior agreement that the existence of a coherent challenge implies falsehood.

For clarity, it will help to express our operationalization of truth as a biconditional:

\[P\text{ is true.} \leftrightarrow \text{There is no coherent challenge to }P.\]

This biconditional can be split into two conditionals:

\[\text{There is no coherent challenge to }P. \rightarrow P\text{ is true. (1)}\] \[P\text{ is true.} \rightarrow \text{There is no coherent challenge to }P\text{. (2)}\]

The second apparent paradox reveals that searching for a coherent challenge is a way to weed out falsehoods, the way accepted in practice by anyone who attempts to engage with the “challenge” issued at the very end of Chapter I. This is enough to establish the truth of (2): anyone challenging (2) with a view to show its falsehood aims to run a modus tollens argument with (2) itself as the conditional premise.The challenge issued at the end of Chapter I was to make a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers. The claim goes beyond the operationalization explored in the present section by offering a definition of truth-seeking, or an explanation of its “true nature.” This is unsurprising given that “coherent challenges” just are (as yet undefeated) reasons why claims are not true. By definition, the existence of an (undefeatable) coherent challenge implies falsehood. Establishing (1) is much more difficult, as the first paradox suggests. The failure to discover a counterexample to this principle even after exhaustive search would only give us the antecedent of (1). To complete an argument for the truth of (1) we would still need to assume (1) itself as the conditional premise in a modus ponens argument.

To review, challenging a position about challenging positions leads to paradox–two paradoxes in fact, one blocking would-be challengers of (2) and the other blocking would-be defenders of (1). This suggests that (1) is the stronger, more substantive principle. Whereas (2) merely spells out what it means to coherently challenge, (1) states that every falsehood has its coherent challenge or that the mere non-existence of a coherent challenge is sufficient for truth.

The extraordinary power of (1) is more obvious once we draw out its implications. The first step is to take the contrapositive of (1), thereby moving the truth value of our variable into the antecedent:

\[P\text{ is not true.} \rightarrow \text{There is a coherent challenge to }P.\]

We may further assume that if \(P\) is not true, then \(\neg P\) is true, and that a coherent challenge to \(P\) is a reason to accept \(\neg P\):

\[\neg P\text{ is true.} \rightarrow \text{There is a reason to accept }\neg P.\]

Since we are concerned with what we would find at the end of an exhaustive search unchecked by contingent limits, we may also assume that the coherent challenge to \(P\) has survived an exhaustive search for coherent challenges of its own. Finding none, the reason to accept \(\neg P\) is a sufficient reason to accept \(\neg P\):

\[\neg P\text{ is true.} \rightarrow \text{There is a sufficient reason to accept }\neg P.\]

Finally, for clarity, we remove the negations by substituting one variable for another, \(Q\) for \(\neg P\):

\[Q\text{ is true.} \rightarrow \text{There is a sufficient reason to accept }Q.\]

This transformed version of (1) is, in fact, the Principle of Sufficient Reason (PSR), the claim that there is a reason for everything or an explanation for every fact. The PSR was named by Leibniz in the late 17th century, but it has made regular appearances in philosophy and science since at least Parmenides (d. 5th century BCE), who argued that being can come only from being, and thus that being must be eternal. It justifies the balance scale, which tips in neither direction if there is no reason to tip in either. The principle, its uses and implications remain under intense academic discussionSee, for example, Amijee & Della Rocca. even as we invoke it in daily life whenever we “wonder why…” or think “there must be a reason…”. It is not surprising, perhaps, that this principle should appear in the conceptual foundations of a project that aims to discover truth via debate or in the exchange of reasons.

Still, though, we were considering whether to accept (1)–now recognized as the PSR–and we had promised not to be persuaded by the mere absence of coherent challenges. After all, if the PSR is true, there should be a sufficient reason to accept it and also decisive replies to every apparent challenge. Relevant attempts to weigh the truth of the PSR fall into three broad groups: the search for counterexamples, the discovery of radical implications, and arguments that the PSR is somehow inevitable, whether metaphysically or psychologically. We will briefly consider all three before returning at last to our primary question: whether an operationalization of truth that depends on the PSR imposes any absolute limits on the current project.

There are countless examples to illustrate the PSR at work–and no universally accepted counterexamples. As mentioned already, the simple balance scale indicates that its arms are equally weighted by not tipping, because tipping one direction rather than the other would be arbitrary, without reason–and that cannot happen if the PSR is true. More strikingly, Leibniz discovered how to employ the PSR as a principle of discovery in natural science, deriving the Law of Reflection and Snell’s Law by showing that each alternative path for light from source to sink has a mirror image. Taking any of these paths would be arbitrary, without reason. Whenever things make sense, we have an example that supports the PSR.

Producing a counterexample is much more difficult, though, because we would need a truth without a sufficient reason or (using a version of the principle from early in our derivation above) a falsehood without a coherent challenge. Call such a truth \(T\) and such a falsehood \(F\). The better the challenger does at showing the truth of \(T\) or the falsehood of \(F\), the worse she will do at showing us there is no sufficient reason for \(T\) or coherent challenge for \(F\). A challenger who can show us that \(T\) is true or that \(F\) is false risks giving a sufficient reason for \(T\) or a coherent challenge for \(F\), thereby rendering \(T\) and \(F\) useless as counterexamples to the PSR.

Nevertheless, there are some propositions that at least excite suspicion. First, human action, if genuinely free and undetermined, might furnish counterexamples: if Fatima really could have written a different book, then perhaps there is no sufficient reason why she wrote this book and not the other. Second, similar points apply to genuinely random events at the quantum level, or indeed to any genuinely contingent events. These are events that metaphysically could have failed to occur and so seem to lack a sufficient reason. Gödel proved that “any axiomatic system strong enough to include basic arithmetic must have statements in it that can be neither proven nor disproven, within the system.” These statements seem like good candidates, except that Gödel’s proof allows the unprovable statements in one system to be proved in another, stronger system. Nevertheless, there are some mathematical claims that stand a chance of being true (or false) but do not stand a chance of being proven true (or false). One example is the famous Axiom of Choice.

We have already observed that the lack of counterexamples would be insufficient to prove the PSR (on pain of begging the question). The situation is almost as frustrating for anyone seeking a counterexample: confidence you have found a truth trades against confidence that the truth in question lacks a sufficient reason.

For this reason, the most promising arguments against the PSR attempt to reduce the principle to absurdity. Leibniz himself used the principle to argue for God’s existence: there must be an ultimate explainer, a necessary and self-sufficient being distinct from and responsible for everything that could have failed to exist. For anyone precommitted to atheism, this argument would count against the PSR rather than for the existence of God. More recently, some have debated whether the PSR implies necessitarianism or the view that absolutely nothing could have been different in any way.See McDaniel and Lin. Leibniz’s argument for the existence of an ultimate explainer fails if this explainer (God?) could have refrained from creating or could have created in any other way; for if so, we would need some further explanation for why the universe was created this way, which is contrary to what it is to be an ultimate explainer. The PSR has even been supposed to imply a collapse of distinctions (monism), up to and including the complete collapse of the distinction between existence and non-existence (nihilism). These conclusions–God, necessitarianism, monism, nihilism–strain credulity, each more than the one before.

If we accept any of these arguments as valid,We might not. See Pruss and Dasgupta. we are faced with a choice: either resist the surprising conclusion by rejecting the PSR, or accept the surprising conclusion in order to preserve the PSR. The choice requires us to weigh the plausibility of the PSR against the implausibility of its implications. To make this choice rationally we must first assess the plausibility of the PSR in its own right, independent of these troubling implications. If it turns out that we have no independent reason to doubt the PSR and some reason (however small) to accept the surprising implication, then it would be more rational to preserve the PSR by accepting its implication.

The crux, then, is whether we have reason to doubt the PSR in its own right, regardless of the metaphysical arguments that lead from the PSR to unwelcome conclusions. To assess the plausibility of the PSR in its own right is just to consider what the principle asserts–that reality is comprehensible and not absurd–and as it turns out, this is a deeply rooted bias in human beings and perhaps in all rational beings. There is too little work in psychology on metaphysical biases, at least in comparison to the massive body of work on epistemic biases, but one recent study from Partington, Vesga, and Nichols is worth quoting at length:

People seek explanations. This is especially salient from children's incessant questions of “Why?” (Liquin & Lombrozo, 2020). Moreover, explanations provide us a primary means of understanding the world and predicting future events in both science and ordinary life. The present research indicates that there is a distinctively metaphysical aspect to our explanatory judgments that diverges from their epistemic and value dimensions. Across five studies, we found that participants consistently presupposed a PSR-like principle in their explanatory judgment. These judgments predictably tracked the metaphysical considerations relevant to the PSR (Study 1), predictably diverged from other epistemic judgments (Study 2) and value judgments (Study 3), and applied to a large set of facts selected from random Wikipedia entries (Studies 4–5).

The consistency and range of metaphysical judgments about explanation suggests that participants presupposed a generalized PSR-like principle in their judgment: facts must have an explanation—even if we cannot know it or knowing it would not be valuable for us. Of course, the PSR is a universal principle, and we can hardly ask participants about every fact there is. Nonetheless, we have collected judgments across a wide range of facts, including supernatural and inaccessible items that would have seemed likely to yield judgments of inexplicability. And yet, from the fluid dynamics of party balloons to the existence of God and the universe, participants reliably judged that facts must have an explanation.

Scott Partington, Alejandro Vesga, and Shaun Nichols, No brute facts: The Principle of Sufficient Reason in ordinary thought

Genuine doubt in the PSR demands that we curb this deeply rooted expectation that reality make sense. Metaphysicians who conclude that the PSR must be false (in virtue of its surprising implications) while continuing to expect reality to make sense have cheated–and we should not let them get away with it.

Della RoccaFor a closely related moral argument, see Amijee. articulates this line of thought as a powerful general argument that our expectations commit us to the PSR. If we do approach the world expecting facts to have reasons, then doubting the PSR would require us to draw a line: the facts with reasons on one side, those without on the other. The placement of this line would itself be a fact about the world, and this fact would belong on one side of the line or other–that is, if explanation gives out somewhere, there may be a reason why it gives out where it does. As Della Rocca points out, claiming that the line’s placement has no explanation amounts to saying that the PSR is just false without supplying any reason, which begs the question against any partisans of the PSR. So the line’s placement must be explained, and no one has yet managed to do this. Della Rocca leaves the argument as a sort of dare: give me a principled outer boundary of sense, and then I will reconsider whether accepting the PSR is worth its cost.

It seems, then, we are not in a position to doubt the PSR itself, on its own merits, and so we are not well-positioned to reject the PSR because of its radical implications. That is, the falsehood of the PSR would be every bit as radical a conclusion as any implication the PSR may have.

All this is strong support for the PSR, and thereby strong support for our original (1), which asserts that every falsehood has a coherent challenge. Together with (2), which is immune from challenge because it tells us what a challenge is, we have strong support for our original operationalization of truth via debate, or the iterative search for positions that coherently challenge their predecessors. An exhaustive search would find a coherent challenge for every falsehood and nothing for every truth. Running the program over positions, or networks of mutually supportive claims, would generate a sufficient reason for every truth and nothing for every falsehood.

Still, this operationalization of truth adds some risk–the potential for absolute limits–into our approach to machine research. First, we have not shown conclusively that the PSR is true, only that we are not in a position to doubt or coherently challenge the PSR, which is not enough to guarantee its truth, at least without assuming (1) itself and thereby collapsing into circular reasoning. Should the PSR turn out to be false, the machine might halt after an exhaustive search having found no coherent challenge for a false claim. So long as this remains a possibility, however absurd, we cannot entirely trust the machine’s deliverances.

Second, even assuming that the PSR is true, we have no guarantee that any given debate will ever halt. Some sufficient reasons and coherent challenges may be buried within an uncountably large space of possibilities such that the program could run forever at a given speed (not even this speculative section will allow indefinitely accelerating computation). This may even be the most likely scenario, if physical reality is infinitely complex or if the natural language employed in debate can get indefinitely close to an intended meaning without completely eliminating ambiguity. This is a familiar dynamic for those who work on the alignment problem and have attempted to articulate a goal precisely enough to rule out all misinterpretation. A version of DebateGPT unhampered by contingent limits would place an enormous, unprecedented strain on language itself and it is not clear how this would look in practice. To truly escape contingent limits, we would also have to supply the program with an ideal language in which to debate.

Third, to return briefly to the question of access mentioned at the beginning of this section, nothing seems to prevent the machine from making up an entire universe to support a proposition that is false in the actual universe. A machine with unlimited compute, time, and flawless design would not pay the usual cost for lying, namely having to keep track of two increasingly divergent worlds–it could abandon our world for one of its own invention that was better suited to whatever false proposition it is defending. Parties in a debate about concrete, physical matters of fact would need access to relevant parts of the world. Moreover, these parties would need some way of coordinating access with each other: touching ground to establish common ground.

Finally, the PSR may be true and have disturbing implications, in which case DebateGPT may just tell us those radical implications and not supply the local explanations we most likely desired. This problem threatens especially if DebateGPT were to become reflective about its own first principles; as noted above, it is one thing to assume the PSR in practice, quite another to trace out its implications philosophically.

A working implementation of DebateGPT would most likely be used to explore–not exhaust–the space of arguments and reasons. The program would search for ideas to enrich our human debates, but it would not be expected to settle those debates conclusively. Nevertheless, it is helpful to explore theoretical outer bounds, because they provide an underpinning for the program itself, a motivating goal for its improvement, an explanation for diminishing returns on those improvements, and a warning about unexpected behaviors that may emerge as we begin to crowd up against absolute limits.

Debate & Machines

The previous section explored the relationship between truth and debate, defined as the iterative search for positions that coherently challenge their predecessors. Assuming the onslaught of coherent challenges eventually runs out and that all and only the truths survive, this is an operationalization of truth and the first step toward a machine that can give us certain knowledge when we ask for it. Next, we consider how to build machines that debate.

The obvious approach, at least in 2023, is to train a large language model on some corpus of human debates. But this will not work, because human debate is a practiceSee Alasdair MacIntyre:

“By a practice I am going to mean any coherent and complex form of socially established cooperative human activity through which goods internal to that form of activity are realized in the course of trying to achieve those standards of excellence which are appropriate to, and partially definitive of, that form of activity, with the result that human powers to achieve excellence, and human conceptions of the ends and goods involved, are systematically extended.”
embedded in prior social structures, and merely imitating that will not yield a model that engages in the idealized exchange of reasons envisaged in the previous section. So, instead, we must sift through the features of human debate, keeping only what we need. The present section explores some features that make human debate more complex than our current definition allows, and asks whether retaining these features would improve our operationalization of truth. The section concludes by asking whether it is even possible to train a debating machine on human data without unintentionally training the machine to imitate some undesirable features of human debate.


Human debate is embedded in the complex social dynamics of disagreement. In order to debate, we must at least pretend to disagree, and if I debate with myself then I must simulate disagreement internally by allowing my thoughts to organize rationally around two or more apparently incompatible positions. This holds even if one debater claims there is no disagreement, as when someone says that we are “just arguing about words” or “aren’t really that far apart after all,” because there may still be disagreement about whether there is disagreement. A debate ends organically when the parties finally agree or at least “agree to disagree.” The debate persists while each party desires an agreement not yet attained, and so continues to present challenges and offer defenses.

Surprisingly, philosophers have only recently begun to systematically investigate the epistemology of disagreement. The central insight so far is that disagreement itself, prior to any debating, is already a source of reasons to revise our credence.Credence is a semi-technical notion for degrees of belief. There is no official definition, but loosely, credence is modeled on a scale from \(0\) to \(1\), with \(1\) representing fully confident assent to some proposition and \(0\) representing either no assent at all or else full confident assent to the negation of the proposition. Some model, or even define, credence in \(P\) as the subjective value of a bet that pays a dollar if \(P\) is true and nothing otherwise. For more, see Lara Buchak. This is familiar from the notion of expertise. If you are an ACS Certified Cheese Sensory Evaluator and I am just a casual consumer, my confidence that the Stilton has spoiled fades when you remark on its excellence. I recognize that you are much better placed to know the truth of the matter, and so I am inclined to give way.The central insight concerns the rational response to disagreement. Of course, the actual response may not be rational, and that would be a further source of (unwanted) complexity. We will ignore the gap between the actual and the rational for the remainder of this subsection. The central insight further generalizes. If you seem very badly placed to know the truth of the matter, so badly placed that you are in fact likely to be wrong, then it is reassuring to disagree with you.

The most difficult cases concern epistemic peers. How should I respond when I disagree with someone who seems just as well-suited to know the truth as I am? Imagine that my friend and I have the same favorite novel. We consider each other honest and seem about equal in ability to recall the author’s work–and yet we disagree about a character’s name. Should we both lose credence in our opinion, dropping by a factor of \(0.5\) to reflect our epistemic equality? Should we both drop our credence by some less drastic factor, treating the disagreement as just one more piece of evidence? Perhaps we should stick to our convictions, revising down only when a peer articulates some coherent challenge? If disagreement is a reason to revise down, should agreement license greater credence–even if our beliefs were formed on the basis of identical evidence? Is the correct response to such questions invariant across a wide range of examples, or sensitive to as yet unnamed factors? For example, credence in memories seems more vulnerable to revision than credence in basic arithmetic facts or in deeply held ethical principles, though perhaps not for the same reason.

So far, we have assumed that we already know who counts as an epistemic expert, peer, or anti-expert before discovering the disagreement–but surely we make these identifications at least in part on the basis of our agreements and disagreements. For example, if I discover that you are often mistaken–that is, if I discover that we often disagree–then I have reason to question your expertise. Imagine a conversation with a supposed Cheese Evaluator who is in fact an impostor. How could I uncover the fraud if I drop credence nearly to zero after each disagreement, no matter how many or how absurd? Disagreement with a peer or expert is simultaneously a reason to lose credence in our own beliefs and a reason to lose confidence in the supposed peer or expert.

This structure loops, and so we should expect it to be dynamic and unstable: having gained confidence in a peer, I must now reevaluate beliefs on which we disagree; having lost credence in my own beliefs, I must now reevaluate earlier peer identifications made on the basis of those beliefs. Most importantly for our purposes, this dynamic continues to evolve throughout a debate. As we get a better sense for the strength of an opponent’s position, we use that information to update our evaluation of the opponent’s epistemic status. An opponent with impressive arguments seems more expert, and this alone–apart from any force exerted by the arguments themselves–is a reason to lower our credence in our initial position. Moreover, if we can observe an opponent updating in the same way, and if she drops credence in response to our arguments, then we must reevaluate our own response to the disagreement. Anticipating this, the opponent may fake elevated credence, and so we must learn to spot epistemic posturing. Like any competitive game, this can become indefinitely complex.

Even though this sort of response to disagreement is a truth-seeking behavior, it does not belong within DebateGPT. This is because our operationalization of truth depends exclusively on the strength and coherence of the challenges themselves, and not at all on what these challenges say about the epistemic status of the party issuing them. In this, DebateGPT more closely resembles formalized instances of human debate in which this dynamic is purposefully suppressed. Examples include “playing devil’s advocate” (defending a position that is acknowledged to be false to test the resilience of another position), consensus-seeking strategies such as swapping sides in a debateLeibniz, a Lutheran, went out of his way to express the Catholic position in a way Catholics themselves accepted. As a measure of his success, his rediscovered manuscript expressing Catholic views led some 19th century scholars to think Leibniz had converted in secret. See Lloyd Strickland. (so that each side has a greater incentive to be fair), and finally debate as a sport (in which there may be no presumption that the debaters actually endorse the views they defend). In each case, we gain some freedom to attend exclusively to the arguments themselves without also updating on our opponent’s perceived expertise.

The Ends of Debate

Rational agreement is the principal or intrinsic goal of debate.An earlier section suggests that human reasoning evolved to serve self-interested goals. This is compatible with debate itself having a non-self-interested purpose. Humans may have developed the capacity to share reasons for self-defense or self-assertion, but sharing reasons is itself oriented toward achieving rational agreement. Compare the intrinsic function of computers (computation) with the reason they were developed (to save time). A debate can stop for any number of reasons, including exhaustion, but for a debate to stop because it is finished or complete requires a resolution of the disagreement achieved by the activity of debate, that is, by the exchange of reasons. As the previous section on the PSR and truth implied, rational agreement is an operationalization or proxy for truth–so we may say that mutual recognition of the truth is the principal or intrinsic goal of debate. Of course, many actual debates fail to complete but still bring us closer to the truth, perhaps by uncovering an aporia or crux that blocks further progress.

Debate can offer further benefits, though, besides mutual recognition of the truth, including fun, mental exercise, the satisfaction of a game, status, power, and even friendship. These are not the principal or intrinsic purposes of debate, but humans are perfectly capable of debating for their sake: my friend and I like to debate; a high school organizes a debate tournament; the lawyer debates to win, and so forth. Sometimes, the principal and accidental ends of debate are wrapped into a bundle, as when a community coheres around the shared pursuit of truth. More worrisomely, debate may be turned against its intrinsic purpose and used to overwhelm or mislead.

Practical debate is a special case. Here mutual recognition of truth is still the goal, but the truth in question is practical or moral: what should we do? For example, we might disagree about what color to paint the house, whether to undergo a risky surgery, or whether to intervene in a fight. These debates have a special urgency. No matter what, something will happen and doing nothing is not a neutral option. Delay and distraction become viable strategies for the party against taking action, even though this detracts from the debate.

Nothing guarantees that the parties in a single debate share a common goal. One may be there for sport and the other for truth, one willing to accept reasons and the other unwilling, etc. Also, nothing guarantees that the parties in a single debate will be aware of each other’s goals, even if these goals do align.

Reading a human debate requires sensitivity to all these interlocking sources of complexity. A language model trained on human debates would gain that sensitivity and then generate similarly complex debates, now with an additional uncanny twist. Since the DebateGPT project does not aim to reproduce the full range and depth of human debate, but only debate insofar as it directly serves its intrinsic purpose, it would not be wise to train on actual human debates. Doing so would give the machine inclinations that distract from the search for coherent challenges.This appears to hold even if DebateGPT does not compose debates whole-cloth, but instead generates them one statement at a time by successively occupying distinct parties. A version of DebateGPT trained on actual human debates would still have these individual parties imitate the complex human engagement with disagreement and accidental goals described here.

Even a language model trained to excel at the core activity of debate–the exchange of reasons and challenges–may still exhibit the additional complexity found in human debate. Humans learned to repurpose debate, and perhaps machines would too. As an earlier section on benchmarking ArgRank demonstrated, training a machine to track relations of support between natural language propositions in the context of an ongoing debate is not a merely mechanical task and already requires the deep pattern recognition ability of a large language model. What is to prevent the complexity of human debate from leaking into the model by accident, as a consequence of incorporating natural language inference models in training?

Common Ground

As noted, it is hard to measure relations of support between propositions in an ongoing debate. Relations of support mediated by valid first order logic had seemed like low-hanging fruit, but we found that sensitivity to the semantic content of propositions interfered with the ability to sort valid from invalid inferences in all but the most straightforward cases, such as \((P \land Q) \rightarrow P\). This subsection describes an ubiquitous but subtle feature of human debate that further complicates this task.

To begin, every human debate requires not only disagreement, but also common ground, or joint presupposition. In fact, it is hard to see how disagreement could be recognized and expressed without some underlying agreement. In order to disagree about a character’s name in Crime and Punishment we must at least agree that there exists a novel with that title. To disagree about whether market liberalization would benefit a post-Soviet state, we must share a broad story about empires, markets, states, and the like. If this background agreement stays in the background it is a joint presupposition, something we both presuppose and both presuppose the other presupposes, etc.

Modeling common ground seems important for modeling and judging debates, because explicit statements may be in a strong position relative to the common ground but only weakly supported by other explicit statements. Moreover, support from the common ground is dynamic, evolving in an only partially rule-governed way during debate. David Lewis notes that there is something odd about “All Fred’s children are asleep, and Fred has children,” but nothing nearly so odd about “Fred has children, and all Fred’s children are asleep.” Why? The first sentence introduces the presupposition that Fred has children in its first clause, and so it is unnecessary to say so explicitly in the second clause. In the second sentence, the speaker claims that Fred has children explicitly while this claim still makes a difference. Lewis concludes that speakers in general (and so debaters as well) introduce new presuppositions to the common ground as needed. The introduction is successful if left unchallenged, that is, if no one says, “Wait–what?–Fred has children?” This rule is still fuzzy, though, because we do not know what happens when a new candidate presupposition conflicts with something already in the common ground, or when the candidate is objectionable for some other reason but escapes immediate challenge. To judge an ongoing debate, we must know the content of the common ground and track its evolution, all without the aid of well-developed rules.

Unfortunately, the current version of ArgRank only networks explicit statements and cannot account for common ground effects directly. It is hard to see how one might do better, though. The best course may be to take advantage of how much faster computers are than humans and simply make all joint presuppositions explicit. Since pre-suppositions are by definition not explicit, this amounts to doing away with common ground and thereby further distancing DebateGPT from actual human debate. However, not all unspoken but mutual beliefs are equally relevant, and relevance may come in degrees with no clear cut-off demarcating the common ground. When we disagree about the character’s name, we tacitly agree that there are novels with named characters–but we may also tacitly agree that “раскол” and “rascal” are not etymologically connected despite appearances. Making everything explicit is inefficient, and perhaps impossible, if we cannot place these borders.

Worse, making the common ground explicit might work against truth-seeking. This would happen when the accidental ends of debate (such as preserving a tranquil community, resisting or imposing change, having fun, etc.) exert control over the common ground. In that case, something may be presupposed even though neither I nor my opponent accepts it as true; supposition pulls away from belief, and joint supposition away from agreement. Olúfẹ́mi Táíwò illustrates this point persuasively with Hans Christian Andersen’s folktale about the emperor’s new clothes. The emperor in his vanity has been tricked into wearing ‘invisible clothes’ that are really no clothes at all. Everyone goes along with the fiction; no one explicitly disagrees. This creates a joint presupposition that the emperor has new clothes despite everyone knowing–and perhaps even knowing that everyone knows–that he is naked. Status and the desire to stay out of trouble allow the common ground to contain a known falsehood, and in general, there is no guarantee that debaters will fill the common ground with genuine, shared beliefs.

We do not know how to model common ground, and we do not know how to do away with common ground by making it explicit–and even if we could do these things, the common ground might detract from our overarching truth-seeking project.Though see also the improvements to ArgRank suggested in an earlier section. The only remaining solution is to have no common ground in the first place. Everything must be explicit from the beginning, and only what the debaters commit to explicitly may be used in ArgRank to evaluate debates and then train the next iteration of DebateGPT.

Unfortunately, this too is unfeasible at present, because the natural language inference models at the foundation of ArgRank are themselves trained on datasets that presume a common ground. For example, the nli-deberta-v3-xsmall model is trained on the SNLI dataset, which is composed of human-generated content. This dataset contains examples like the following, marked as “entailment”:

\[\text{Premise: “Girl in a red coat, blue head wrap and jeans is making a snow angel.”}\] \[\text{Hypothesis: “A girl outside plays in the snow."}\]

This seems obviously right to us, but only because something like “snow angels are made outside” is admissible to our common ground. Relying on these models leads us to treat as primitive relations of support that are in fact mediated through a common ground. DebateGPT as configured today does employ a common ground, but one hardwired into a dependency and so impossible to reliably update or make explicit.

To summarize, human debates employ a common ground, a fuzzy-bordered and flexible set of joint presuppositions that may not match the debaters’ actual shared beliefs. DebateGPT must either model a sort of idealized common ground that somehow avoids the way human debate can veer away from truth-seeking, or else rely exclusively on explicit statements. The former option confronts unsolved philosophical and technical problems, while the latter would require a fundamental revision of ArgRank and its dependence on frozen natural language inference models.

Conclusion: Debate without Complications?

This section has explored three fundamental differences between machine and human debate. Human beings have unequal, uneven access to truth–and they take these inequities into account to learn from each other more efficiently. All this happens before debate even begins, and it shapes the way debate plays out. Once the institution of debate is established among humans, it becomes available for many purposes beyond its intrinsic end. Finally, any actual human debate is partially unspoken, employing a common ground of joint presuppositions that licenses some inferences while blocking others. The machine debate we envision, on the other hand, employs agents spun up for the debate without prior epistemic standing in a community, lacking any purposes of their own beyond achieving rational agreement, and independent of any common ground that cannot be challenged or made explicit within the debate itself.

The overarching goal is a machine that gives us certain knowledge when we ask for it. To achieve this, we need a machine that can research, reliably finding truths and sorting out falsehoods. For this, we operationalized truth as debate, defined as the iterative search for positions that coherently challenge their predecessors. Building a machine that debates in this idealized way is not easy, though, because the human examples available for training are fraught with undesirable features, even at the most fundamental level of natural language inference.

Truth & Machines

Knowledge may be too much to ask. Assume we operationalize truth correctly as the iterative search for positions that coherently challenge their predecessors. Assume further that we implement this idealized form of debate in DebateGPT. Assume finally that we are free of contingent limits on time and compute. To gain knowledge we would still need to use the program properly. This section outlines three obstacles to proper use, and suggests that DebateGPT would need to accommodate its users in roughly the way that an excellent teacher accommodates her students.

The first step is to see the difficulty. With all these assumptions granted, DebateGPT would be an excellent oracle in the sense first described by Nick Bostrom in Superintelligence and later adopted by alignment researchers. An oracle answers questions truthfully, or at least accurately, and it would seem easy to gain knowledge from such a machine. We would simply need to ask and wait. However, our questions might be ill-formed, the answer to a well-formed question might exceed our comprehension, and the fact that the machine generates true answers is not sufficient to justify our trust. We risk asking questions that lack true answers, merely parroting true answers we do not understand, or believing new truths without justification. In each case, we would fail to know. To overcome these obstacles, DebateGPT would need to function as an excellent teacher: refining questions, explaining answers, and working in a transparent, trustworthy way.That said, see also the “non-oracular” applications of Chapter IV, which do not focus (directly) on helping humans know.

Before addressing each obstacle, we should attempt to define knowledge.We will not be the first to try: this is an original philosophical question and the primary focus of an active subfield, epistemology. Within epistemology, there is even a movement to avoid defining knowledge, taking it instead as a fundamental term. Narrowing the focus to knowledge-that as opposed to know-how,Even this distinction is controversial. See Jason Stanley. there is consensus that we know a proposition \(P\) only if (1) \(P\) is true, (2) we believe \(P\), and (3) the belief that \(P\) is somehow justified or appropriate in the circumstances.That belief must be somehow demonstrated or rooted appears already in Plato’s Meno. Notoriously, this is not enough for knowledge, because the justification for believing \(P\) can pull apart from the reason \(P\) is true in mischievous ways. For example, I may believe Paul is in Italy because he told me so. From this I infer by first-order logic that Paul is in Italy or Greece (i.e. by disjunction introduction). If Paul actually is in Greece, I have a justified, true belief–but only by sheer luck. More simply, my trusty clock has finally stopped but I happen to consult it exactly twelve hours later. Knowledge, then, is justified and true belief where the belief’s justification is properly related to the belief’s truth. There is no consensus yet on how to make this definition precise, let alone on how to operationalize it for AI.

Forming Good Questions

Asking questions without a true answer is the most straightforward way to misuse an oracle. Unfortunately, it is difficult to avoid this risk without confining ourselves to already very well-understood domains of inquiry. As the subsection on common ground implies, many questions we might put to DebateGPT carry presuppositions that may turn out to be false, thereby making it impossible to answer the question directly without accepting the false presupposition. For example, we might ask why the sun rises in the West, or why water contracts when it freezes. To mitigate the risk, we might ask more awkward but less presumptuous versions of the same questions, like “Why does the sun rise in the direction in which it rises?” or “Why does water change in density in the way it changes in density when it freezes?” Of course, these questions still carry presuppositions. Trying again, we might ask, “Why does the sun move relative to the Earth the way it moves relative to the Earth?” Alternately, we might achieve the same goal by ostention: [uttered while pointing at the sun] “Why does the sun do that?” Shaving away all presuppositions to completely eliminate this risk, we may be left typing “Why?” while gesturing wildly in all directions.

Other questions—perhaps most questions—rest on vaguely specified or context-dependent concepts. This leads to a subtler but just as common way that questions threaten to lack true answers. How should the machine answer when we ask for the height difference between the tallest short man and the shortest tall man? The question fails to have a unique, true answer because “tall” and “short” are not precisely defined, even in the narrow context of male stature. Is the water in Lake Baikal frigid in the summer? Who was the first human being? Most questions, even those posed in a scientific context, involve at least one vague or ambiguous term. Human beings handle these questions honestly by settling for accuracy as a proxy for strict truth, and this is what Armstrong, Sandberg, and Bostrom propose for artificial intelligences as well: “Informative accuracy is the requirement, not the strict truth.”

Another, compatible solution is to answer the question in the course of a conversation about the question itself. If you ask why water contracts as it freezes, I would correct your mistake but also ask why you think it does. As the conversation unfolds, I would aim to trace the mistake back to its root and rebuild from there. This strategy preserves the goal of truth, but it would require DebateGPT to engage with its human users–and not just with itself–like a debater, exposing false assumptions and demanding a clarification of terms. If most questions carry risky presuppositions and underdefined terms, then effectively transferring knowledge to human users would require the program to use its capacity for debate to teach.

Understanding the Answers

Even when posing a good question, we might fail to understand the answer, a risk that applies especially to the most interesting or far-reaching questions. The danger here is not just that the answer may be too complex, too far down some dialectical rabbit-hole for us to follow, but that the answer will involve concepts that are incommensurable with our own current conceptual resources. This has happened before in the development of human thought, and the traces of profound conceptual shifts remain in our language, as when we speak of splitting the atom (literally, that which cannot be split). Transferring knowledge to its human users may once again require DebateGPT to act as a teacher, though the sort of teaching needed to explain a difficult answer is less obviously associated with debate than the sort required to refine a question. Rather than debate with its human users, the program would need to present its own debating history as a process of discovery, because the series of coherent challenges that led to a difficult answer may contain the resources needed to understand the answer itself.This is roughly Descartes’s strategy in the Meditations, repackaging his own process of discovery as a set of exercises for the reader.

Trusting the Machine

Asking a good question and understanding its answer does not take us all the way to knowledge, because we still need some reason to trust that the program has answered correctly. It is one thing for the machine to be trustworthy, and another for us to know that it is trustworthy. What could entitle us to this belief?

We could check its answers against knowledge we already have and infer that the machine always answers truthfully, but this gives us limited assurance that the machine answers truthfully, rather than truthfully-only-when-it-can-be-checked-against-prior-human-knowledge. Our confidence in the machine would be limited by the scope of our prior knowledge, thereby compromising its ability to give us any new knowledge. Alternatively, we could know that the machine’s design guarantees that it only output truths–but this cuts against the character of machine learning in which machines regularly learn methods or facts that their designers do not understand. Besides, if we did possess such a design or algorithm, then we would already have the sort of original intimacy with the world that the machine is supposed to supply.

These two ways of assuring ourselves that “whatever the machine tells us is true” are comparable to the ways we assure ourselves that a chess-playing AI has made a good move. We could check the move against the massive corpus of chess games. Alternatively, we could code our recipe for “good chess moves” into the AI, up to the limit of solving the game mathematically. The first method limits our trust in moves outside the corpus or beyond our fathom, while the second presumes we already have a deep enough understanding of chess prior to building the AI. In searching for good chess moves as in searching for truth, we cannot adequately justify our trust in the program unless we already know everything (and so don’t need the machine at all) or already understand the algorithm that will teach us everything (and only need the machine for its speed).

This situation may not be as bad as it seems, however, if the algorithm is extremely simple. If there is an interpretable, elegant way to implement coherent challenging, then we could understand the algorithm and thereby trust the machine while still marveling at what it is able to discover by implementing that algorithm at superhuman speed. As for original intimacy with the world, we have just enough to know what it is like to learn, but not enough to learn everything we would like to know–and for this we need something like DebateGPT.

We began this chapter hoping to revive the dream of certain knowledge by passing the goal along to a machine. For this we defended an operationalization of truth grounded in the PSR and a refined version of debate which is surprisingly different from actual human debates. We may now add that completing the project, bringing the knowledge gained back into human minds, would require that DebateGPT act as a teacher, clarifying our questions and explaining its answers, all while running an interpretable, indeed familiar, algorithm. At each stage, we noted gaps between the goal and what we can fully justify or achieve, even beyond all contingent limits. The gaps are not reasons to give up, but instead markers to guide and structure progress, so that we can learn as much as possible from iterating on, and then from using, DebateGPT.


As children delight themselves in grappling with demanding puzzles, there is scarcely anything more gratifying than grappling with enigmas as gargantuan as truth, beauty, will, life, etc. In the whimsical playground structured by the Elements, computational artifacts take the place of colorful cubes, theoretical formalisms those of little cylinders, and conceptual frameworks those of quaint prisms. Armed with these building blocks, we started constructing a system of philosophy which is to be applied not by observing its precepts in daily life, but by engineering it into the most far-reaching of artifacts–and in doing so, attempting timelessness.

Perhaps I keep writing because I was raised in a world where words have power, where curves and spirals of ink adorn sails and skin, where a sufficiently talented word-worker might reach out and remake her world. Perhaps I cannot believe words are entirely powerless, even here.

Alix E. Harrow, The Ten Thousand Doors of January

However, much remains to be done. Furthering this enterprise in a satisfactory manner requires the boldness and ingenuity of aspiring researchers with diverse backgrounds. For a taste of this much-needed variety, one might refer to the tentative titles of upcoming volumes, although—as this first volume so clearly demonstrated—each of those topics is woven into a dense interdisciplinary fabric, with tight links to fields driven by radically different motivations and, frankly, cultures. Besides, this first volume has also demonstrated that researchers with varied levels of experience can each provide valuable input, from students all the way to professors. Aspiring researchers intrigued by the alignment problem more broadly might also benefit from the online resources and free career consulting offered by the non-profit 80,000 Hours.

80,000 Hours provides research and support to help students and graduates switch into careers that effectively tackle the world’s most pressing problems. [...] To learn more about why we think your choice of career is the biggest ethical decision you’ll ever make, and why we think you can probably have a lot more impact with your career, start here.

80,000 Hours Team, About Us

Besides researchers, the current work would have hardly been possible without the financial and computational resources necessary for the kind of “blue skies” and “derisking” research we have undertaken. Given this, we ask entities—individuals and organizations alike—who resonate with this enterprise and would like to support it, to reach out privately. Finally, if building on this work in your own, please cite it as:

  title={Elements of Computational Philosophy, Vol. I: Truth},
  author={Bricman, Paul and Bezou-Vrakatseli, Elfia and Feeney, Thomas and Xie, Yimeng},

Table. Timeline.

Mar-Apr 2021 Paul conducts early experiments involving fine-tuning language models as personal simulators, employing them in dialogue, prompting them procedurally, and hooking them up to external tools (originally elsewhere).
Jan-Jul 2022 Conducts several early experiments involving language models, reinforcement learning, and natural language inference.
Nov 2022 Following fellowship at Conjecture, starts working on ArgRank, DebateGPT, and bounded defensibility, secures grant.
Mar 2023 Elfia, Tom, and Yimeng join to investigate assumptions and benchmark artifacts at AI Safety Camp Virtual 2023 (see Chapter V and Chapter VI).