The Commitment Gap: Why LLMs Struggle with System 2 Reasoning and the Case for a Cognitive Kernel

Large Language Models (LLMs) have mastered fast, fluent, and intuitive generation. In the lexicon of cognitive science, this is the epitome of System 1: quick, associative pattern matching that yields highly plausible answers with zero explicit deliberation. However, as we push these models toward complex software engineering, scientific workflows, and long-horizon autonomous decision-making, a persistent failure mode emerges. Scaling up parameters and data sets improves average performance, but it fails to reliably produce the structured, reversible, and constraint-respecting reasoning characteristic of System 2.

This essay posits that this gap is not merely a symptom of insufficient model size or training data. It is a fundamental flaw in the structure of autoregressive inference. We must stop treating the LLM as a monolithic, infallible brain. Instead, we should pivot to treating it as a probabilistic proposer operating within a deterministic control loop—a system that enforces commitments, rigidly checks constraints, and actively manages working memory. I call this control loop the Cognitive Kernel.

1. The Fallacy of Incremental Commitment

Autoregressive generation factorizes language into a sequence of conditional predictions: the model selects the next token based purely on the current prefix. The primary issue here is not that the model is Markovian (it conditions on the entire history), but rather that the decoding process demands incremental and local commitment.

Once the model commits to a token, that granular decision inextricably becomes the context for every subsequent token. If an early decision is slightly off-course, the model faces a dilemma. It can explicitly correct itself—which requires it to admit inconsistency while somehow maintaining narrative coherence—or, more perniciously, it can manufacture a plausible continuation that retroactively justifies the initial error. In long-horizon tasks, the latter is mathematically easier. This deeply ingrained behavior is the root cause of hallucinations and “reasoning after the fact.”

We can define this localized failure as the Commitment Gap. LLM decoding commits locally and irreversibly, whereas true System 2 reasoning operates on the principle of deferred commitment, postponing final decisions until sufficient evidence has been collated and verified.

2. System 2 is Not a Prompt; It is a Process

System 2 is colloquially described as slow, effortful, and attention-demanding. But for engineering autonomous agents, the psychological metaphor is irrelevant; what matters is the process architecture.

Robust System 2-like behavior practically necessitates three core properties:

It explicitly represents constraints and actively evaluates candidate actions against them.
It can compare multiple alternative trajectories laterally before finalizing a decision.
It retains the structural ability to revise or backtrack when empirical verification fails.

A naive LLM prompt guarantees none of these properties. Even sophisticated Chain-of-Thought (CoT) prompting merely encourages prolonged hallucinated deliberation; it does not instantiate a mechanistic enforcement of verification or reversible search. Fluency may increase significantly, but correctness remains structurally unguaranteed.

3. The Cognitive Kernel: A Minimal Architecture

The Cognitive Kernel is an OS-like control loop explicitly designed to decouple the act of proposing from the act of committing. Within this architecture, the LLM retains its undeniable value as a highly capable generator of hypotheses, plans, and candidate actions. The Kernel provides the missing deterministic structure: verification, active memory scheduling, and commitment gating.

At minimum, the Kernel consists of three architectural components:

3.1 The Shell: A Constrained Interface to the World

The Kernel prohibits the LLM from interacting with the environment through raw, unstructured text. Instead, the Shell exposes a strict taxonomy of operations. Tasks are rigidly framed as objectives and explicit constraints. Actions are serialized as explicit tool calls or state mutations. The Shell ensures that generative proposals are intercepted and evaluated in a standardized format, completely segregated from free-form narrative generation. This transcends prompt engineering; it is about defining a stable API for cognition.

3.2 The Scheduler: Working Memory ≠ Context Window

The industry conflates massive context windows with “working memory”, but passive buffers are conceptually distinct from active cognition. A context window simply holds what has happened. True working memory is an active, discriminatory process that dynamically decides what to retain, what to retrieve, and what to highlight.

The Kernel’s Scheduler curates which pieces of evidence enter the active context at any given execution step via dynamic retrieval, compression, or semantic indexing. The critical design choice is that context selection becomes an explicit, revisable action—not an incidental artifact of prior token generation.

3.3 The Verifier: The Proposal-Verification Loop

This is the beating heart of the system. The Verifier is a deterministic module that issues an explicit pass or fail signal for any proposed action, strictly anchored to declared constraints.

Implementation modalities vary by domain:

Software Engineering: Running isolated unit tests, type checkers, and static analyzers.
Mathematics: Interfacing with symbolic solvers or validating intermediate steps against proven invariants.
Embodied Agents: Verifying spatial parameters against observed environment state arrays.

The paradigm shift is absolute: verification strictly precedes commitment. If a proposal fails verification, the Kernel aborts the commitment and forcefully triggers a revision step.

4. A Concrete Example: Code Patching via Enforced Verification

Imagine an autonomous agent tasked with fixing a subtle bug in a date-parsing function. A standalone LLM rapidly hallucinates a patch that looks syntactically plausible but fails on edge cases. Under the Cognitive Kernel, the workflow transforms entirely:

Proposer: The LLM generates a candidate patch.
Shell: The patch is applied in an isolated programmatic sandbox.
Verifier: A deterministic test suite and linter evaluate the sandbox.
Scheduler: If tests fail, the Scheduler extracts only the specific failing stack trace and the relevant function signature.
Proposer: The LLM generates a revised patch, strictly conditioned on the isolated failure evidence.
Kernel Loop: This cycle repeats until the Verifier passes or a maximum compute budget is exhausted.

The system abandons reliance on the language model’s inherent “honesty.” Instead, it anchors reliability entirely on explicit, external verification signals.

5. Formalizing the System

The Kernel orchestrates actions that are simultaneously highly useful and strictly compliant by merging a probabilistic utility signal with a deterministic verification signal.

5.1 Verification-Conditioned Acceptance

Let $a$ be a proposed action at state $s$. Let $V(a,s)$ be the deterministic Verifier’s boolean output. The Kernel executes the commitment function strictly if the verifier passes:

\[Commit(a \mid s) = \mathbb{1}[V(a,s)=1]\]

5.2 Utility Under Constraints

When the model proposes multiple passing actions, the Kernel optimizes for structural utility subject to the verification constraint:

\[a^{*} = \arg\max_{a \in \mathcal{A}(s)} U(a,s) \quad \text{s.t.} \quad V(a,s)=1\]

6. Why This Subverts the Scaling Laws

Brute-force scaling of training compute yields incrementally better proposers. Integrating a Cognitive Kernel scales inference-time structural reasoning. The paradigm allows us to explicitly trade inference latency for reliability—exploring vast combinatorial spaces of proposals and deterministically rejecting those that fail verification.

The profound engineering lesson of the current AI epoch is that impenetrable reasoning in real-world workflows is a system-level property, not a model-level attribute. It cannot be reduced to a single API call.

Closing Thoughts

LLMs are miraculous probabilistic proposers, but achieving System 2-grade reliability requires more than injecting “think step-by-step” into a prompt. It requires a rigid architectural mechanism that gates commitments, actively manages working memory, and inextricably binds action to verification.

The Cognitive Kernel formalizes this missing mechanism. The future of AI relies not on building increasingly eloquent chatbots, but on deploying probabilistic generative engines deeply embedded within an operating system built natively for structural reasoning.

Hengle JIANG