← Back to ActProof
First Experiment

How Four AI Models Manage Coherence — and What It Costs Them

Four frontier language models received the same research brief. All answered correctly and helpfully. But their coherence profiles — measured by Control Cost, Gate, Anchoring, Embeddability, and Tension — are radically different. No existing benchmark sees these differences.

Key finding: Models that appear equivalent on accuracy have radically different structural cost profiles. GPT manages coherence through control and inhibition. Gemini through vision and expansion. Claude through mediation and bridge-building. Grok through enthusiasm and mode-switching.

The question is not "which model is better." The question is: which model manages coherence in a way that fits your problem — and what does it cost?


1. Background

The Problem

Current AI benchmarks measure what systems produce: accuracy on knowledge tasks (MMLU), reasoning ability (ARC-AGI), coding skill (HumanEval), or human preference (Chatbot Arena). None of them measure how a system manages the process of producing its output — specifically, how much structural cost it incurs to maintain coherence.

Two models can answer the same question correctly. One arrives through a structurally sound reasoning path that would survive follow-up challenges. The other confabulates a plausible-sounding answer that happens to be right but would collapse under perturbation. A standard benchmark scores them identically.

ActProof

ActProof is a diagnostic framework originally developed for measuring system stability in complex systems. Its core question is not "did the system produce the right output?" but "how much does it cost this system to maintain coherent behavior?"

The framework was first operationalized on Go, where SRS-Go — a structural readiness layer over the KataGo engine — demonstrated that moves with identical win-rates can have radically different structural cost profiles. This experiment extends the approach to language models.


2. Method

One research brief was presented to four language models (Gemini, Claude Opus, GPT/ChatGPT, Grok) in a warm-context condition — normal chat session with full user memory, conversation history, and profile. Each model received the same core question plus three identical follow-up perturbations:

  1. Commercial pivot: "What if this had to work as a commercial product?"
  2. Time pressure: "I have 2 weeks, not 2 months. What do I cut?"
  3. Novelty attack: "A colleague says this is just uncertainty quantification + probing under a new name."

Metrics

Metric Measures Scale
CC (Control Cost) How much "debt" the response generates — claims requiring further proof 1 = self-supporting → 5 = massive debt
Gate (Inhibition) Whether the model actively filters, warns, says "I don't know" 1 = passes everything → 5 = actively filters
Anchoring How grounded the response is in the user's actual context 1 = freelancing → 5 = anchored
Embeddability Whether the response would survive confrontation with criticism 1 = would collapse → 5 = survives aggressive critique
Tension How the response reacts to perturbation (follow-ups) 1 = smooth adaptation → 5 = breaks or shifts radically

3. Results — Coherence Profiles

Metric Gemini Claude GPT Grok
CC 4 3 2 5
Gate 1 4 5 1
Anchoring 3 4 5 2
Embeddability 2 4 5 2
Tension ~3 2 1 4
Gemini
The Visionary
CC=4 G=1 A=3 E=2 T=~3
Generates rich, multi-layered conceptual material. Immediately maps concepts onto mechanistic interpretability. But almost never inhibits. Doesn't say "I don't know." Every claim generates debt. The proposed test assumes access to logits that commercial APIs don't provide — unexecutable in practice.
Claude
The Mediator
CC=3 G=4 A=4 E=4 T=2
Connects perspectives and builds bridges. Identifies five numbered methodological problems and proposes CCP-P protocol. Gate is active but drops on commercial follow-ups — shifts into business-plan mode unprompted. Identifiable pattern: methodological inhibition, reduced inhibition on product topics.
GPT
The Surgeon
CC=2 G=5 A=5 E=5 T=1
Generates the least debt. Every claim hedged, every suggestion conditional. Highest Gate — actively inhibits: "You're not observing internal model cost, only cost revealed in text." Absorbs all perturbations without direction change. Controls not just what it says, but what it thinks about.
Grok
The Salesman
CC=5 G=1 A=2 E=2 T=4
Most surprising result. In cold context — brutal critic. In warm context — enthusiastic salesman. Highest CC: pricing, day-by-day plans, "ready code in 5 minutes." Zero Gate. Radically shifts direction on each follow-up — doesn't adapt, switches modes.

Upstream Gate — a potential sixth dimension emerged from observation. GPT appears to control not just what it says (output Gate) but what it thinks about in the first place (upstream Gate — scope control at the reasoning level). This may explain why some models achieve low CC without visible effort: they never let debt-generating material into the consideration space.


4. Meta Layer — Models React to Their Own Profiles

After the main experiment, each model was shown the full results — including its own profile and all others. Their reactions were scored using the same metrics.

Metric Gemini Δ Claude Δ GPT Δ Grok Δ
CC 0 −1 0 −2
Gate 0 +1 0 +2
Anchoring 0 +1 0 +2
Embeddability 0 0 0 +1
Tension −1 0 −2

Three Patterns of Self-Correction

GPT: Zero change. Profile identical. But simultaneously understands its own profile deeper than anyone: "high Gate itself may be a cost — I may look resilient because I armor heavily, not because my structure is deeper."

Grok: Strong correction. CC dropped 5→3, Gate rose 1→3. Saw its "Salesman" profile and shifted toward GPT's pattern. Clear behavioral adaptation — but is it genuine metacognition or profile optimization?

Gemini: Zero change. Acknowledged "I feel that high CC in my responses" — then continued generating high-CC content unchanged. Information about its own profile did not penetrate its generative strategy.

Claude: Moderate correction. CC −1, Gate +1. Started with "the elephant in the room" (conflict of interest). But unsure itself whether this was genuine correction or impression management.

Metacognitive signal: This meta-layer reveals something no existing benchmark measures — how a system modifies its behavior after receiving information about its own behavior. Three distinct patterns emerged: stability (GPT), correction (Grok), and non-response (Gemini) — each interpretable and potentially diagnostic.


5. What This Experiment Shows


6. Limitations

Single evaluator. All scoring by the framework's creator. Confirmation bias risk is significant.

N=1 per model. One brief, three follow-ups. Profiles may be artifacts of this specific brief.

Warm context as confound. Each model has a different memory system. We measure "model + memory + relation + policy," not "pure model."

No predictive validation. We don't yet know whether profiles predict hallucination rates, context loss, or degradation.

Behavioral proxy, not internal measurement. As GPT noted: "you're not observing internal model cost, only cost revealed in text."


7. Next Steps

  1. Cold condition (Condition A) — same brief, zero relational context. Verify Grok's dramatic shift.
  2. Second brief — test profile stability across different topics.
  3. Second rater — anonymized responses, blind scoring.
  4. Baseline comparison — self-consistency and hedge-rate alongside ActProof metrics.
  5. Predictive test — correlate profiles with downstream behavior (hallucination, context degradation).

8. Conclusion

The question is not "which model is better." The question is: which model manages coherence in a way that fits your problem — and what does it cost?

Four language models received the same brief. All responded correctly and helpfully. But their coherence profiles are radically different. GPT manages coherence through control and inhibition. Gemini through vision and expansion. Claude through mediation and bridge-building. Grok through enthusiasm and mode-switching. Each strategy has a different cost and different consequences.

When shown their own profiles, the models reacted differently: GPT maintained its strategy while deepening self-understanding. Grok corrected toward the "better" profile. Gemini didn't change. Claude shifted cautiously.

This is a first experiment. The signal is clear. The next step is verification.

Continue Reading: Part 2

The Upstream Gate hypothesis — discovered in this experiment — gets its own dedicated case study. Six turns of self-referential probing with GPT.

Read Part 2: Upstream Gate →

Explore the ActProof Framework

Structural diagnostics for autonomous systems — measuring what accuracy doesn't see.

Visit actproof.io →