Experiments

Confirmation Bias

Measuring how LLMs double down

A 2-minute interactive demonstration showing how a single committed token can create apparent confidence in an LLM, and how the model's own explanation can steer subsequent predictions.

More info

Why I built it

People often say AI "doubles down" on its answers. Asking follow-up questions seems to make it more confident, not less. I wanted to measure this directly. Is the model actually becoming more certain, or is it just being consistent with what it already wrote? This experiment separates those two explanations.

What it measures

  • Initial uncertainty. Before the model writes anything, we measure its probability distribution over options using logprobs. Often genuinely uncertain.
  • Post-commitment lock-in. After the model commits to an answer and explains it, we ask it to pick again. The distribution typically collapses to near-100%.
  • Steering from text. We take the model's explanation, sanitize it (remove option labels), and inject it into a fresh call. Does the reasoning itself steer the new prediction?

What surprised me

  • Label tokens matter. Switching from "1/2/3" to "A/B/C" format can shift probabilities by 20+ percentage points on identical semantics. The model isn't just reasoning. It's pattern-matching over tokens.
  • Sanitization isn't enough. Even after removing explicit choice labels from the reasoning, semantic cues ("crucial for trust," "guaranteed value") still carry steering signal.
  • Run-to-run variance is high. At temp=0.9, results vary significantly between runs. That variance is itself informative. It shows the model's uncertainty is real, not just hedging in the text.

Temporal Reasoning

How models fail to account for the passage of time

An evaluation-style experiment on whether models notice when old deadlines, appointments, and timeframes have expired inside long-running conversations.