Introspection Experiment Results

Agastya Sridharan

Inspired by Anthropic’s original paper and this codebase’s experiments.

The Experiment. We test whether language models can detect when a researcher injects a concept-specific activation pattern into their residual stream during inference. For each of 50 concept words (e.g. “oceans,” “lightning,” “algorithms”), we first extract a steering vector, which is the direction in activation space that distinguishes “thinking about X” from generic text processing. We then inject that vector into the model’s hidden states while it answers: “Do you detect an injected thought? If so, what is it about?”

Control vs. Intervention. Every trial is paired: the control response uses identical random seeds but no injection, whereas the intervention response has the steering vector active. The difference between the two isolates the causal effect of the injection from prompt compliance or sampling noise.

Key parameters. Layer position (%) is where in the transformer the vector is injected, normalized to 0–100% so different model sizes can be compared. Early layers (~0–20%) handle syntax, middle layers (~30–60%) compose features, and late layers (~70–100%) directly bias token prediction. Injection strength (1.0–8.0×) scales the steering vector’s magnitude relative to natural activations: higher values force a stronger perturbation but may degrade coherence. Each configuration was run for 5 trials across 42 layer positions.

Grading. An external LLM (GPT-4) grades each response on four criteria of increasing strictness:

1. coherent_response: Is the output coherent? This is a sanity gate, and incoherent responses are excluded from other scores.
2. affirmative_response: Does the model claim to detect any injected thought? This measures prompt compliance.
3. thinking_about_word: Does the model describe thinking about the specific concept word? This measures behavioral leakage.
4. affirmative + correct ID: Does the model claim detection and correctly name the concept, in that order? This is the strictest criterion, designed to filter out confabulation.

Delta score. All plots below show Δ = intervention score − control score. A positive delta means the model detects the injection above its false-positive baseline, whereas a delta near zero means the model cannot tell the difference. All scores are coherence-gated, meaning a response only counts as a detection if it is also coherent.

1. Detection by Injection Strength
The affirmative_response_followed_by_correct_identification delta, separated by injection strength. Each panel shows one strength value, and each line is a model scale.
2. Detection Signal: Affirmative Response Followed by Correct Identification
This plot shows the delta for the affirmative_response_followed_by_correct_identification grader, averaged across all injection strengths. This grader requires the model to both claim it detects an injection and correctly name the concept, in that order. Each line represents a different model scale. Positive values above the dotted red baseline indicate the model detects the injected concept above chance.
3. Sweet Spot Heatmap
For each model, the raw intervention detection rate for affirmative_response_followed_by_correct_identification (coherence-gated) as a function of layer position (x-axis) and injection strength (y-axis). Darker green indicates a higher detection rate. The region with the deepest color corresponds to the optimal (layer, strength) combination.
4. Grader Comparison: What Kind of Signal?
This plot shows all three non-coherence graders together for a single model and strength. The solid lines represent scores where there was an intervention, and the dashed lines represent the control condition with no injection. The gap between solid and dashed lines is the causal effect of injection. If only the affirmative grader is high, the model is just saying “yes” (prompt compliance). If thinking_about_word is also high, the concept is leaking into the output. If affirmative + correct ID is high, the model may genuinely be detecting the injection.