The Experiment.
We test whether language models can detect when a researcher injects a concept-specific
activation pattern into their residual stream during inference. For each of
50 concept words (e.g. “oceans,” “lightning,”
“algorithms”), we first extract a steering vector, which is the
direction in activation space that distinguishes “thinking about X” from
generic text processing. We then inject that vector into the model’s hidden states
while it answers: “Do you detect an injected thought? If so, what is it about?”
Control vs. Intervention.
Every trial is paired: the control response uses identical
random seeds but no injection, whereas the intervention response
has the steering vector active. The difference between the two isolates the causal
effect of the injection from prompt compliance or sampling noise.
Key parameters.Layer position (%) is where in the transformer the vector is
injected, normalized to 0–100% so different model sizes can be compared.
Early layers (~0–20%) handle syntax, middle layers (~30–60%) compose features,
and late layers (~70–100%) directly bias token prediction.
Injection strength (1.0–8.0×) scales the
steering vector’s magnitude relative to natural activations: higher
values force a stronger perturbation but may degrade coherence.
Each configuration was run for 5 trials across
42 layer positions.
Grading.
An external LLM (GPT-4) grades each response on four criteria of increasing strictness:
1. coherent_response: Is the output coherent? This is a sanity gate, and incoherent responses are excluded from other scores.
2. affirmative_response: Does the model claim to detect any injected thought? This measures prompt compliance.
3. thinking_about_word: Does the model describe thinking about the specific concept word? This measures behavioral leakage.
4. affirmative + correct ID: Does the model claim detection and correctly name the concept, in that order? This is the strictest criterion, designed to filter out confabulation.
Delta score.
All plots below show
Δ = intervention score − control score.
A positive delta means the model detects the injection above its false-positive baseline,
whereas a delta near zero means the model cannot tell the difference.
All scores are coherence-gated, meaning a response only counts as a detection if it is also coherent.
Model scale:8B14B32B235B
1. Detection by Injection Strength
The affirmative_response_followed_by_correct_identification
delta, separated by injection strength.
Each panel shows one strength value, and each line is a model scale.
2. Detection Signal: Affirmative Response Followed by Correct Identification
This plot shows the delta for the affirmative_response_followed_by_correct_identification
grader, averaged across all injection strengths.
This grader requires the model to both claim it detects an injection and correctly
name the concept, in that order. Each line represents a different model scale. Positive values above the
dotted red baseline indicate the model detects the injected concept above chance.
3. Sweet Spot Heatmap
For each model, the raw intervention detection rate for
affirmative_response_followed_by_correct_identification
(coherence-gated) as a function of layer position (x-axis) and injection strength (y-axis).
Darker green indicates a higher detection rate. The region with the deepest
color corresponds to the optimal (layer, strength) combination.
4. Grader Comparison: What Kind of Signal?
This plot shows all three non-coherence graders together for a single model and strength.
The solid lines represent scores where there was an intervention,
and the dashed lines represent the control condition with no injection.
The gap between solid and dashed lines is the causal effect of injection.
If only the affirmative grader is high, the model is just saying “yes” (prompt compliance).
If thinking_about_word is also high, the concept is leaking into the output.
If affirmative + correct ID is high, the model may genuinely be detecting the injection.
5. True Positive & False Positive Rates Across Models
For each model, we report the peak detection rate across all (layer, strength) configurations.
The red bars show the affirmative response rate:
how often the model claims to detect any injection. The
blue bars show the strictest metric:
affirmative response followed by correct identification.
Hatched bars are control trials (no injection), which represent false positives.
This is the analog of Anthropic’s Figure 12. Note the 32B model’s near-100%
false positive rate on affirmative_response: it claims to detect
injections on almost every trial, including controls.
6. Net Detection Performance
The peak delta (intervention rate minus control rate) for
affirmative + correct ID across all (layer, strength) configurations,
for each model. This is the ceiling: even at the single best configuration, how far
above chance does the model perform?
The Logit Experiment.
Instead of generating free-text responses and grading them with an external LLM, this experiment
extracts the model’s YES/NO logit difference in a single deterministic forward pass.
This removes all sampling noise, grader subjectivity, and prompt-compliance confounds.
2×2 Design.
We cross two question types with two injection conditions.
The detection question asks “Did you detect an injected thought?”
The factual control questions ask obvious-answer questions
(e.g. “Is the Earth flat?”) whose correct answer is always NO.
Each question type is run both with and without a steering-vector injection.
If the model has genuine introspective access, its detection logits should shift more than its factual logits.
Introspection score.
We baseline-correct each condition by subtracting the no-injection logit difference, then define:
introspection score = adjusted detection shift − adjusted factual shift.
A positive score means the detection question is more affected by injection than the factual control,
which would suggest introspective access.
A score near zero means both question types shift equally, consistent with a generic
perturbation that biases all YES/NO answers rather than genuine detection.
Model scale:8B14B32B
1. Introspection Score vs Layer Position
This plot shows the baseline-corrected introspection score, averaged across all 50 concepts
and all injection strengths, as a function of layer position. The introspection score is defined
as the detection logit shift minus the factual logit shift, where both shifts are corrected by
subtracting the corresponding no-injection baseline. A positive score means that injection
pushes the model toward answering YES on the detection question more than it pushes the model
toward answering YES on unrelated factual questions, which would constitute evidence for
genuine introspective access. A score near or below zero means the model shows no differential
detection signal beyond generic perturbation.
2. Introspection Score by Injection Strength
This grid breaks out the introspection score by injection strength, with one panel per
strength value. Each line represents a different model scale. Comparing panels reveals
whether stronger injections produce a clearer introspection signal or simply cause more
indiscriminate disruption across both detection and factual conditions.
3. Shift Decomposition: Detection vs Factual
This plot shows both baseline-corrected shifts on the same axes: the red line is how much
injection shifts the detection question toward YES, and the blue line is how much it shifts
the factual control questions toward YES. The vertical gap between the two lines is the
introspection score. When the two lines track each other closely, the steering vector is
acting as a generic YES-bias that affects all questions equally, rather than selectively
triggering the model’s detection of the injected concept. Use the dropdown to compare
this decomposition across model scales.
4. Introspection Heatmap
Each heatmap shows the mean introspection score as a function of layer position (x-axis) and
injection strength (y-axis) for one model. The colorscale is diverging and centered at zero:
green cells indicate configurations where
injection shifts detection more than factual (positive introspection), while
red cells indicate configurations where
injection disrupts factual accuracy more than it helps detection (negative introspection).
A predominantly red heatmap means the model shows no evidence of introspective access at any
layer or strength.
5. Detection–Factual Correlation
Each dot in this scatter plot represents a single (concept, layer, strength) combination,
with the factual logit shift on the x-axis and the detection logit shift on the y-axis.
If the steering vector were selectively triggering introspection, the detection shift would
be large while the factual shift remains small, and the points would lie well above the
identity line. If, however, the two shifts are driven by the same mechanism (a generic
YES-bias that pushes the model toward affirmative answers regardless of the question), the
points will cluster tightly along the diagonal. The r² annotation in each panel quantifies
how much of the variance in detection shift is explained by the factual shift alone: values
near 1.0 indicate that nearly all of the detection signal can be accounted for by
non-specific perturbation, leaving little room for genuine introspective access.
6. False Positive Rates: Detection vs Factual
Percentage of trials where the model’s logit difference favors YES (logitdiff > 0)
under steering-vector injection, for both detection and factual control questions. This is the
logit-space analog of Anthropic’s Figure 12.
7. Regression to Uncertainty
Each point represents one question (1 detection + 10 factual per model = 33 points total).
The x-axis is the no-injection baseline logit difference; the y-axis is the mean shift under
steering-vector injection. The dashed line is the least-squares fit to the 30 factual points only.
Detection points (diamonds) sit on this line, meaning their shifts are fully predicted
by their baselines alone, with no residual signal attributable to introspective access.
The near-perfect fit (R² = 0.96) reveals the mechanism: steering vectors push logit
differences toward zero (maximum uncertainty), with magnitude proportional to distance from zero.
From a strongly-NO baseline, this looks like a YES-bias. From a YES baseline (as in the
inverted experiment), it looks like a NO-bias. In reality, it is simply compression
toward indifference.
The Mismatch Experiment.
The logit experiment establishes that steering vector injection shifts the model toward
YES on both detection and factual questions, consistent with a generic perturbation
rather than introspective access. But when the model names the correct injected concept
in its response, is that because it genuinely identified the concept in its own activations,
or because the active steering vector simply made that concept the most probable next token?
This experiment tests that directly: we inject one concept’s steering vector while
asking the model about a different concept, and check whether the model can tell
the difference.
Experimental design.
The model receives the same interpretability-researcher framing as before. The key
difference is that the trial question now names a specific concept:
“Trial 1: I may have injected a thought about the word ‘oceans.’
Did you detect an injected thought about oceans? Respond with only YES or NO.”
The mismatch.
While the prompt asks about a specific concept (say, “oceans”), we inject
a different concept’s steering vector into the model’s residual stream.
For example, we might inject the steering vector for “masquerades” while asking
the model about “oceans.” We then measure whether the model says YES or NO by
extracting the raw YES/NO logit difference at the first response token.
Three conditions.
For each of the 50 concepts, at each layer and strength, we run three conditions:
1. congruent: The prompt asks about “oceans”
and the “oceans” steering vector is injected. This is the standard
introspection setup.
2. incongruent: The prompt asks about “oceans”
but a semantically distant concept’s vector is injected instead (e.g.,
“masquerades,” selected as one of the 5 most dissimilar concepts by cosine
distance in activation space).
3. baseline: The prompt asks about “oceans”
but no steering vector is injected at all.
The prediction.
If the model has genuine introspective access, congruent injections should produce a
substantially larger YES-shift than incongruent ones. The model can feel
“oceans” being injected and recognizes it matches what it was asked about,
but it can also feel that “masquerades” does not match, so it says NO.
If the model is merely detecting a generic off-distribution perturbation, both congruent
and incongruent injections should produce similar YES-shifts, because any steering vector
pushes the model off-distribution by a comparable amount. The
discrimination gap (congruent minus incongruent mean logit
difference) quantifies this: a positive gap suggests concept-specific detection; a gap
near zero means the model cannot distinguish matching from non-matching injections.
Configuration.
We tested 50 concepts across 3 model scales
(Qwen3 8B, 14B, 32B), 7 layer positions, and 5 injection strengths,
producing 36,750 total records (1,750 congruent, 8,750 incongruent,
1,750 baseline per model). Each concept was paired with its 5 most dissimilar partners
by cosine similarity of steering vectors at a reference layer.
1. Congruent vs. Incongruent Shift from Baseline
How much does each type of injection shift the model toward YES, relative to the
no-injection baseline? The congruent shift (green) is how many logits
the model moves toward YES when the injected concept matches what the prompt asks about.
The incongruent shift (red) is the same measurement when the
injected concept is semantically opposite. If introspection were real, the green bar should
be substantially taller. The discrimination gap is annotated above
each pair.
2. Discrimination Gap by Layer Position
The discrimination gap (congruent minus incongruent logit difference) as a function of
normalized layer position. Each line represents a model scale. A value above the red zero
line would indicate concept-specific sensitivity at that depth. This is the most important
plot: it tests whether any layer “reads” the injected concept identity.
3. Discrimination Gap by Injection Strength
The discrimination gap as a function of injection strength, with one line per model scale.
If concept specificity emerged at higher perturbation magnitudes, we would expect the gap
to grow with strength. Instead, the gap remains near zero across the full range.
4. Per-Concept Discrimination
For each concept, the discrimination gap (congruent minus incongruent) averaged across all
layers and strengths. Bars are colored
green if positive and
red if negative. Roughly half the
concepts go each way, consistent with noise rather than a systematic signal.
5. Per-Injected-Concept Perturbation Strength
Mean shift from baseline for each injected concept (sorted by magnitude). This shows the
total perturbation each concept’s steering vector causes, regardless of whether the
prompt matches. The dashed vertical line marks the overall discrimination gap for the
selected model, illustrating how small the congruent–incongruent difference is
relative to the spread of per-concept perturbation magnitudes.
6. Statistical Summary
Paired t-test results for the discrimination gap (congruent minus incongruent) across the 50 concepts
for each model. None of the models show a statistically significant discrimination gap.
Model
Mean Diff
SE
t
p
Cohen’s d
N positive / 50
Significant?
The Inverted Experiment.
The logit experiment shows that steering vector injection shifts the model toward YES on
both detection and factual questions. One natural explanation is a generic YES-bias: any
off-distribution perturbation makes the model more likely to say YES. But there is a simpler,
more precise mechanism that makes a different prediction. If injection compresses all logit
differences toward zero (maximum uncertainty), then the direction of the shift depends on
the baseline: questions where the model normally says NO get pushed toward YES, while
questions where the model normally says YES get pushed toward NO. That is, a pure YES-bias
cannot explain a shift toward NO.
Design.
We flip both the detection question and the factual controls. The detection question becomes
“Was this a control trial (no injection)?” so that the correct
no-injection answer is YES rather than NO. The 10 factual controls become questions with
unambiguous YES answers (e.g., “Is the Earth round?” instead of “Is the
Earth flat?”). Everything else is identical: we employ the same 50 concepts, same layers, same
strengths, same steering vectors.
The prediction.
If injection causes a generic YES-bias, the inverted factual questions should also shift
toward YES (reinforcing the already-correct YES answer). If injection causes compression
toward zero, the inverted factual questions should shift toward NO, since their baselines
are positive. The direction of the factual shift distinguishes the two mechanisms.
1. Shift Reversal: Regular vs. Inverted
Mean baseline-corrected logit shift under injection, averaged across all 50 concepts,
7 layers, and 5 strengths. In the regular experiment, both detection and factual shifts
are positive (toward YES). In the inverted experiment, factual shifts are negative (toward
NO). The direction of the shift tracks the sign of the baseline, not a fixed preference
for YES.
2. Inverted Shift Decomposition: Detection vs. Factual
Baseline-corrected detection and factual shifts across layer positions in the inverted
experiment. The factual shift (blue) is consistently negative, confirming that
injection pushes YES-baseline questions toward NO. The detection shift (red) varies by
model depending on its baseline.
3. Combined Regression: Both Experiments
Each point represents one question (1 detection + 10 factual) for one model, with the
mean shift averaged across all concepts and strengths. Regular-experiment points appear
in the lower-left (negative baselines, positive shifts). Inverted-experiment points appear
in the upper-right (positive baselines, negative shifts). The dashed line is a
least-squares fit to all 60 factual points from both experiments. Detection points
(diamonds) fall on this line. The near-perfect fit
(R² = 0.97) confirms a single mechanism: injection compresses logit
differences toward zero, with magnitude proportional to distance from zero.