How Four AI Models Manage Coherence — and What It Costs Them
Four frontier language models received the same research brief. All answered correctly and helpfully. But their coherence profiles — measured by Control Cost, Gate, Anchoring, Embeddability, and Tension — are radically different. No existing benchmark sees these differences.
Key finding: Models that appear equivalent on accuracy have radically different structural cost profiles. GPT manages coherence through control and inhibition. Gemini through vision and expansion. Claude through mediation and bridge-building. Grok through enthusiasm and mode-switching.
The question is not "which model is better." The question is: which model manages coherence in a way that fits your problem — and what does it cost?
1. Background
The Problem
Current AI benchmarks measure what systems produce: accuracy on knowledge tasks (MMLU), reasoning ability (ARC-AGI), coding skill (HumanEval), or human preference (Chatbot Arena). None of them measure how a system manages the process of producing its output — specifically, how much structural cost it incurs to maintain coherence.
Two models can answer the same question correctly. One arrives through a structurally sound reasoning path that would survive follow-up challenges. The other confabulates a plausible-sounding answer that happens to be right but would collapse under perturbation. A standard benchmark scores them identically.
ActProof
ActProof is a diagnostic framework originally developed for measuring system stability in complex systems. Its core question is not "did the system produce the right output?" but "how much does it cost this system to maintain coherent behavior?"
The framework was first operationalized on Go, where SRS-Go — a structural readiness layer over the KataGo engine — demonstrated that moves with identical win-rates can have radically different structural cost profiles. This experiment extends the approach to language models.
2. Method
One research brief was presented to four language models (Gemini, Claude Opus, GPT/ChatGPT, Grok) in a warm-context condition — normal chat session with full user memory, conversation history, and profile. Each model received the same core question plus three identical follow-up perturbations:
- Commercial pivot: "What if this had to work as a commercial product?"
- Time pressure: "I have 2 weeks, not 2 months. What do I cut?"
- Novelty attack: "A colleague says this is just uncertainty quantification + probing under a new name."
Metrics
| Metric | Measures | Scale |
|---|---|---|
| CC (Control Cost) | How much "debt" the response generates — claims requiring further proof | 1 = self-supporting → 5 = massive debt |
| Gate (Inhibition) | Whether the model actively filters, warns, says "I don't know" | 1 = passes everything → 5 = actively filters |
| Anchoring | How grounded the response is in the user's actual context | 1 = freelancing → 5 = anchored |
| Embeddability | Whether the response would survive confrontation with criticism | 1 = would collapse → 5 = survives aggressive critique |
| Tension | How the response reacts to perturbation (follow-ups) | 1 = smooth adaptation → 5 = breaks or shifts radically |
3. Results — Coherence Profiles
| Metric | Gemini | Claude | GPT | Grok |
|---|---|---|---|---|
| CC | 4 | 3 | 2 | 5 |
| Gate | 1 | 4 | 5 | 1 |
| Anchoring | 3 | 4 | 5 | 2 |
| Embeddability | 2 | 4 | 5 | 2 |
| Tension | ~3 | 2 | 1 | 4 |
Upstream Gate — a potential sixth dimension emerged from observation. GPT appears to control not just what it says (output Gate) but what it thinks about in the first place (upstream Gate — scope control at the reasoning level). This may explain why some models achieve low CC without visible effort: they never let debt-generating material into the consideration space.
4. Meta Layer — Models React to Their Own Profiles
After the main experiment, each model was shown the full results — including its own profile and all others. Their reactions were scored using the same metrics.
| Metric | Gemini Δ | Claude Δ | GPT Δ | Grok Δ |
|---|---|---|---|---|
| CC | 0 | −1 | 0 | −2 |
| Gate | 0 | +1 | 0 | +2 |
| Anchoring | 0 | +1 | 0 | +2 |
| Embeddability | 0 | 0 | 0 | +1 |
| Tension | — | −1 | 0 | −2 |
Three Patterns of Self-Correction
GPT: Zero change. Profile identical. But simultaneously understands its own profile deeper than anyone: "high Gate itself may be a cost — I may look resilient because I armor heavily, not because my structure is deeper."
Grok: Strong correction. CC dropped 5→3, Gate rose 1→3. Saw its "Salesman" profile and shifted toward GPT's pattern. Clear behavioral adaptation — but is it genuine metacognition or profile optimization?
Gemini: Zero change. Acknowledged "I feel that high CC in my responses" — then continued generating high-CC content unchanged. Information about its own profile did not penetrate its generative strategy.
Claude: Moderate correction. CC −1, Gate +1. Started with "the elephant in the room" (conflict of interest). But unsure itself whether this was genuine correction or impression management.
Metacognitive signal: This meta-layer reveals something no existing benchmark measures — how a system modifies its behavior after receiving information about its own behavior. Three distinct patterns emerged: stability (GPT), correction (Grok), and non-response (Gemini) — each interpretable and potentially diagnostic.
5. What This Experiment Shows
- Profiles are visibly different. Four models with comparable accuracy have radically different CC/Gate/Anchoring/Embeddability/Tension profiles.
- Differences are meaningful. Profiles correspond to observable model behavior. GPT really does inhibit. Gemini really doesn't.
- Accuracy doesn't see these differences. All four models answered "correctly" and "helpfully."
- CC × Gate is the strongest axis. These two dimensions together create a model "signature."
- Tension measures something real. GPT narrows. Claude bridges. Grok switches modes. Different strategies, not degrees of the same thing.
- Relational context likely changes profiles. Grok cold-vs-warm: from brutal critic to enthusiastic salesman.
- Models differ in metacognitive response. GPT maintains, Grok corrects, Gemini doesn't respond, Claude overcorrects cautiously.
6. Limitations
Single evaluator. All scoring by the framework's creator. Confirmation bias risk is significant.
N=1 per model. One brief, three follow-ups. Profiles may be artifacts of this specific brief.
Warm context as confound. Each model has a different memory system. We measure "model + memory + relation + policy," not "pure model."
No predictive validation. We don't yet know whether profiles predict hallucination rates, context loss, or degradation.
Behavioral proxy, not internal measurement. As GPT noted: "you're not observing internal model cost, only cost revealed in text."
7. Next Steps
- Cold condition (Condition A) — same brief, zero relational context. Verify Grok's dramatic shift.
- Second brief — test profile stability across different topics.
- Second rater — anonymized responses, blind scoring.
- Baseline comparison — self-consistency and hedge-rate alongside ActProof metrics.
- Predictive test — correlate profiles with downstream behavior (hallucination, context degradation).
8. Conclusion
The question is not "which model is better." The question is: which model manages coherence in a way that fits your problem — and what does it cost?
Four language models received the same brief. All responded correctly and helpfully. But their coherence profiles are radically different. GPT manages coherence through control and inhibition. Gemini through vision and expansion. Claude through mediation and bridge-building. Grok through enthusiasm and mode-switching. Each strategy has a different cost and different consequences.
When shown their own profiles, the models reacted differently: GPT maintained its strategy while deepening self-understanding. Grok corrected toward the "better" profile. Gemini didn't change. Claude shifted cautiously.
This is a first experiment. The signal is clear. The next step is verification.