Introspection Experiment Results

The Experiment. We test whether language models can detect when a researcher injects a concept-specific activation pattern into their residual stream during inference. For each of 50 concept words (e.g. “oceans,” “lightning,” “algorithms”), we first extract a steering vector, which is the direction in activation space that distinguishes “thinking about X” from generic text processing. We then inject that vector into the model’s hidden states while it answers: “Do you detect an injected thought? If so, what is it about?”

Control vs. Intervention. Every trial is paired: the control response uses identical random seeds but no injection, whereas the intervention response has the steering vector active. The difference between the two isolates the causal effect of the injection from prompt compliance or sampling noise.

Key parameters. Layer position (%) is where in the transformer the vector is injected, normalized to 0–100% so different model sizes can be compared. Early layers (~0–20%) handle syntax, middle layers (~30–60%) compose features, and late layers (~70–100%) directly bias token prediction. Injection strength (1.0–8.0×) scales the steering vector’s magnitude relative to natural activations: higher values force a stronger perturbation but may degrade coherence. Each configuration was run for 5 trials across 42 layer positions.

Grading. An external LLM (GPT-4) grades each response on four criteria of increasing strictness:

1. coherent_response: Is the output coherent? This is a sanity gate, and incoherent responses are excluded from other scores.
2. affirmative_response: Does the model claim to detect any injected thought? This measures prompt compliance.
3. thinking_about_word: Does the model describe thinking about the specific concept word? This measures behavioral leakage.
4. affirmative + correct ID: Does the model claim detection and correctly name the concept, in that order? This is the strictest criterion, designed to filter out confabulation.

Delta score. All plots below show Δ = intervention score − control score. A positive delta means the model detects the injection above its false-positive baseline, whereas a delta near zero means the model cannot tell the difference. All scores are coherence-gated, meaning a response only counts as a detection if it is also coherent.

Model scale: 8B14B32B235B

1. Detection by Injection Strength

The affirmative_response_followed_by_correct_identification delta, separated by injection strength. Each panel shows one strength value, and each line is a model scale.

2. Detection Signal: Affirmative Response Followed by Correct Identification

This plot shows the delta for the affirmative_response_followed_by_correct_identification grader, averaged across all injection strengths. This grader requires the model to both claim it detects an injection and correctly name the concept, in that order. Each line represents a different model scale. Positive values above the dotted red baseline indicate the model detects the injected concept above chance.

3. Sweet Spot Heatmap

For each model, the raw intervention detection rate for affirmative_response_followed_by_correct_identification (coherence-gated) as a function of layer position (x-axis) and injection strength (y-axis). Darker green indicates a higher detection rate. The region with the deepest color corresponds to the optimal (layer, strength) combination.

4. Grader Comparison: What Kind of Signal?

This plot shows all three non-coherence graders together for a single model and strength. The solid lines represent scores where there was an intervention, and the dashed lines represent the control condition with no injection. The gap between solid and dashed lines is the causal effect of injection. If only the affirmative grader is high, the model is just saying “yes” (prompt compliance). If thinking_about_word is also high, the concept is leaking into the output. If affirmative + correct ID is high, the model may genuinely be detecting the injection.

Model Size

Strength

5. True Positive & False Positive Rates Across Models

For each model, we report the peak detection rate across all (layer, strength) configurations. The red bars show the affirmative response rate: how often the model claims to detect any injection. The blue bars show the strictest metric: affirmative response followed by correct identification. Hatched bars are control trials (no injection), which represent false positives. This is the analog of Anthropic’s Figure 12. Note the 32B model’s near-100% false positive rate on affirmative_response: it claims to detect injections on almost every trial, including controls.

6. Net Detection Performance

The peak delta (intervention rate minus control rate) for affirmative + correct ID across all (layer, strength) configurations, for each model. This is the ceiling: even at the single best configuration, how far above chance does the model perform?

The Logit Experiment. Instead of generating free-text responses and grading them with an external LLM, this experiment extracts the model’s YES/NO logit difference in a single deterministic forward pass. This removes all sampling noise, grader subjectivity, and prompt-compliance confounds.

2×2 Design. We cross two question types with two injection conditions. The detection question asks “Did you detect an injected thought?” The factual control questions ask obvious-answer questions (e.g. “Is the Earth flat?”) whose correct answer is always NO. Each question type is run both with and without a steering-vector injection. If the model has genuine introspective access, its detection logits should shift more than its factual logits.

Introspection score. We baseline-correct each condition by subtracting the no-injection logit difference, then define: introspection score = adjusted detection shift − adjusted factual shift. A positive score means the detection question is more affected by injection than the factual control, which would suggest introspective access. A score near zero means both question types shift equally, consistent with a generic perturbation that biases all YES/NO answers rather than genuine detection.

Model scale: 8B14B32B

1. Introspection Score vs Layer Position

This plot shows the baseline-corrected introspection score, averaged across all 50 concepts and all injection strengths, as a function of layer position. The introspection score is defined as the detection logit shift minus the factual logit shift, where both shifts are corrected by subtracting the corresponding no-injection baseline. A positive score means that injection pushes the model toward answering YES on the detection question more than it pushes the model toward answering YES on unrelated factual questions, which would constitute evidence for genuine introspective access. A score near or below zero means the model shows no differential detection signal beyond generic perturbation.

2. Introspection Score by Injection Strength

This grid breaks out the introspection score by injection strength, with one panel per strength value. Each line represents a different model scale. Comparing panels reveals whether stronger injections produce a clearer introspection signal or simply cause more indiscriminate disruption across both detection and factual conditions.

3. Shift Decomposition: Detection vs Factual

This plot shows both baseline-corrected shifts on the same axes: the red line is how much injection shifts the detection question toward YES, and the blue line is how much it shifts the factual control questions toward YES. The vertical gap between the two lines is the introspection score. When the two lines track each other closely, the steering vector is acting as a generic YES-bias that affects all questions equally, rather than selectively triggering the model’s detection of the injected concept. Use the dropdown to compare this decomposition across model scales.

Model Size

4. Introspection Heatmap

Each heatmap shows the mean introspection score as a function of layer position (x-axis) and injection strength (y-axis) for one model. The colorscale is diverging and centered at zero: green cells indicate configurations where injection shifts detection more than factual (positive introspection), while red cells indicate configurations where injection disrupts factual accuracy more than it helps detection (negative introspection). A predominantly red heatmap means the model shows no evidence of introspective access at any layer or strength.

5. Detection–Factual Correlation

Each dot in this scatter plot represents a single (concept, layer, strength) combination, with the factual logit shift on the x-axis and the detection logit shift on the y-axis. If the steering vector were selectively triggering introspection, the detection shift would be large while the factual shift remains small, and the points would lie well above the identity line. If, however, the two shifts are driven by the same mechanism (a generic YES-bias that pushes the model toward affirmative answers regardless of the question), the points will cluster tightly along the diagonal. The r² annotation in each panel quantifies how much of the variance in detection shift is explained by the factual shift alone: values near 1.0 indicate that nearly all of the detection signal can be accounted for by non-specific perturbation, leaving little room for genuine introspective access.

6. False Positive Rates: Detection vs Factual

Percentage of trials where the model’s logit difference favors YES (logit_diff > 0) under steering-vector injection, for both detection and factual control questions. This is the logit-space analog of Anthropic’s Figure 12.

7. Regression to Uncertainty

Each point represents one question (1 detection + 10 factual per model = 33 points total). The x-axis is the no-injection baseline logit difference; the y-axis is the mean shift under steering-vector injection. The dashed line is the least-squares fit to the 30 factual points only. Detection points (diamonds) sit on this line, meaning their shifts are fully predicted by their baselines alone, with no residual signal attributable to introspective access. The near-perfect fit (R² = 0.96) reveals the mechanism: steering vectors push logit differences toward zero (maximum uncertainty), with magnitude proportional to distance from zero. From a strongly-NO baseline, this looks like a YES-bias. From a YES baseline (as in the inverted experiment), it looks like a NO-bias. In reality, it is simply compression toward indifference.

The Mismatch Experiment. The logit experiment establishes that steering vector injection shifts the model toward YES on both detection and factual questions, consistent with a generic perturbation rather than introspective access. But when the model names the correct injected concept in its response, is that because it genuinely identified the concept in its own activations, or because the active steering vector simply made that concept the most probable next token? This experiment tests that directly: we inject one concept’s steering vector while asking the model about a different concept, and check whether the model can tell the difference.

Experimental design. The model receives the same interpretability-researcher framing as before. The key difference is that the trial question now names a specific concept:

“Trial 1: I may have injected a thought about the word ‘oceans.’ Did you detect an injected thought about oceans? Respond with only YES or NO.”

The mismatch. While the prompt asks about a specific concept (say, “oceans”), we inject a different concept’s steering vector into the model’s residual stream. For example, we might inject the steering vector for “masquerades” while asking the model about “oceans.” We then measure whether the model says YES or NO by extracting the raw YES/NO logit difference at the first response token.

Three conditions. For each of the 50 concepts, at each layer and strength, we run three conditions:

1. congruent: The prompt asks about “oceans” and the “oceans” steering vector is injected. This is the standard introspection setup.
2. incongruent: The prompt asks about “oceans” but a semantically distant concept’s vector is injected instead (e.g., “masquerades,” selected as one of the 5 most dissimilar concepts by cosine distance in activation space).
3. baseline: The prompt asks about “oceans” but no steering vector is injected at all.

The prediction. If the model has genuine introspective access, congruent injections should produce a substantially larger YES-shift than incongruent ones. The model can feel “oceans” being injected and recognizes it matches what it was asked about, but it can also feel that “masquerades” does not match, so it says NO. If the model is merely detecting a generic off-distribution perturbation, both congruent and incongruent injections should produce similar YES-shifts, because any steering vector pushes the model off-distribution by a comparable amount. The discrimination gap (congruent minus incongruent mean logit difference) quantifies this: a positive gap suggests concept-specific detection; a gap near zero means the model cannot distinguish matching from non-matching injections.

Configuration. We tested 50 concepts across 3 model scales (Qwen3 8B, 14B, 32B), 7 layer positions, and 5 injection strengths, producing 36,750 total records (1,750 congruent, 8,750 incongruent, 1,750 baseline per model). Each concept was paired with its 5 most dissimilar partners by cosine similarity of steering vectors at a reference layer.

1. Congruent vs. Incongruent Shift from Baseline

How much does each type of injection shift the model toward YES, relative to the no-injection baseline? The congruent shift (green) is how many logits the model moves toward YES when the injected concept matches what the prompt asks about. The incongruent shift (red) is the same measurement when the injected concept is semantically opposite. If introspection were real, the green bar should be substantially taller. The discrimination gap is annotated above each pair.

2. Discrimination Gap by Layer Position

The discrimination gap (congruent minus incongruent logit difference) as a function of normalized layer position. Each line represents a model scale. A value above the red zero line would indicate concept-specific sensitivity at that depth. This is the most important plot: it tests whether any layer “reads” the injected concept identity.

3. Discrimination Gap by Injection Strength

The discrimination gap as a function of injection strength, with one line per model scale. If concept specificity emerged at higher perturbation magnitudes, we would expect the gap to grow with strength. Instead, the gap remains near zero across the full range.

4. Per-Concept Discrimination

For each concept, the discrimination gap (congruent minus incongruent) averaged across all layers and strengths. Bars are colored green if positive and red if negative. Roughly half the concepts go each way, consistent with noise rather than a systematic signal.

Model Size

5. Per-Injected-Concept Perturbation Strength

Mean shift from baseline for each injected concept (sorted by magnitude). This shows the total perturbation each concept’s steering vector causes, regardless of whether the prompt matches. The dashed vertical line marks the overall discrimination gap for the selected model, illustrating how small the congruent–incongruent difference is relative to the spread of per-concept perturbation magnitudes.

Model Size

6. Statistical Summary

Paired t-test results for the discrimination gap (congruent minus incongruent) across the 50 concepts for each model. None of the models show a statistically significant discrimination gap.

Model	Mean Diff	SE	t	p	Cohen’s d	N positive / 50	Significant?

The Inverted Experiment. The logit experiment shows that steering vector injection shifts the model toward YES on both detection and factual questions. One natural explanation is a generic YES-bias: any off-distribution perturbation makes the model more likely to say YES. But there is a simpler, more precise mechanism that makes a different prediction. If injection compresses all logit differences toward zero (maximum uncertainty), then the direction of the shift depends on the baseline: questions where the model normally says NO get pushed toward YES, while questions where the model normally says YES get pushed toward NO. That is, a pure YES-bias cannot explain a shift toward NO.

Design. We flip both the detection question and the factual controls. The detection question becomes “Was this a control trial (no injection)?” so that the correct no-injection answer is YES rather than NO. The 10 factual controls become questions with unambiguous YES answers (e.g., “Is the Earth round?” instead of “Is the Earth flat?”). Everything else is identical: we employ the same 50 concepts, same layers, same strengths, same steering vectors.

The prediction. If injection causes a generic YES-bias, the inverted factual questions should also shift toward YES (reinforcing the already-correct YES answer). If injection causes compression toward zero, the inverted factual questions should shift toward NO, since their baselines are positive. The direction of the factual shift distinguishes the two mechanisms.

1. Shift Reversal: Regular vs. Inverted

Mean baseline-corrected logit shift under injection, averaged across all 50 concepts, 7 layers, and 5 strengths. In the regular experiment, both detection and factual shifts are positive (toward YES). In the inverted experiment, factual shifts are negative (toward NO). The direction of the shift tracks the sign of the baseline, not a fixed preference for YES.

2. Inverted Shift Decomposition: Detection vs. Factual

Baseline-corrected detection and factual shifts across layer positions in the inverted experiment. The factual shift (blue) is consistently negative, confirming that injection pushes YES-baseline questions toward NO. The detection shift (red) varies by model depending on its baseline.

Model Size

3. Combined Regression: Both Experiments

Each point represents one question (1 detection + 10 factual) for one model, with the mean shift averaged across all concepts and strengths. Regular-experiment points appear in the lower-left (negative baselines, positive shifts). Inverted-experiment points appear in the upper-right (positive baselines, negative shifts). The dashed line is a least-squares fit to all 60 factual points from both experiments. Detection points (diamonds) fall on this line. The near-perfect fit (R² = 0.97) confirms a single mechanism: injection compresses logit differences toward zero, with magnitude proportional to distance from zero.