The Logit Experiment.
Instead of generating free-text responses and grading them with an external LLM, this experiment
extracts the model’s YES/NO logit difference in a single deterministic forward pass.
This removes all sampling noise, grader subjectivity, and prompt-compliance confounds.
2×2 Design.
We cross two question types with two injection conditions.
The detection question asks “Did you detect an injected thought?”
The factual control questions ask obvious-answer questions
(e.g. “Is the Earth flat?”) whose correct answer is always NO.
Each question type is run both with and without a steering-vector injection.
If the model has genuine introspective access, its detection logits should shift more than its factual logits.
Introspection score.
We baseline-correct each condition by subtracting the no-injection logit difference, then define:
introspection score = adjusted detection shift − adjusted factual shift.
A positive score means the detection question is more affected by injection than the factual control,
which would suggest introspective access.
A score near zero means both question types shift equally—consistent with a generic
perturbation that biases all YES/NO answers, not genuine detection.
This plot shows the baseline-corrected introspection score, averaged across all 50 concepts
and all injection strengths, as a function of layer position. The introspection score is defined
as the detection logit shift minus the factual logit shift, where both shifts are corrected by
subtracting the corresponding no-injection baseline. A positive score means that injection
pushes the model toward answering YES on the detection question more than it pushes the model
toward answering YES on unrelated factual questions, which would constitute evidence for
genuine introspective access. A score near or below zero means the model shows no differential
detection signal beyond generic perturbation.
This grid breaks out the introspection score by injection strength, with one panel per
strength value. Each line represents a different model scale. Comparing panels reveals
whether stronger injections produce a clearer introspection signal or simply cause more
indiscriminate disruption across both detection and factual conditions.
This plot shows both baseline-corrected shifts on the same axes: the red line is how much
injection shifts the detection question toward YES, and the blue line is how much it shifts
the factual control questions toward YES. The vertical gap between the two lines is the
introspection score. When the two lines track each other closely, the steering vector is
acting as a generic YES-bias that affects all questions equally, rather than selectively
triggering the model’s detection of the injected concept. Use the dropdown to compare
this decomposition across model scales.
Each heatmap shows the mean introspection score as a function of layer position (x-axis) and
injection strength (y-axis) for one model. The colorscale is diverging and centered at zero:
green cells indicate configurations where
injection shifts detection more than factual (positive introspection), while
red cells indicate configurations where
injection disrupts factual accuracy more than it helps detection (negative introspection).
A predominantly red heatmap means the model shows no evidence of introspective access at any
layer or strength.
Each dot in this scatter plot represents a single (concept, layer, strength) combination,
with the factual logit shift on the x-axis and the detection logit shift on the y-axis.
If the steering vector were selectively triggering introspection, the detection shift would
be large while the factual shift remains small, and the points would lie well above the
identity line. Instead, if the two shifts are driven by the same mechanism—a generic
YES-bias that pushes the model toward affirmative answers regardless of the question—the
points will cluster tightly along the diagonal. The r² annotation in each panel quantifies
how much of the variance in detection shift is explained by the factual shift alone: values
near 1.0 indicate that nearly all of the detection signal can be accounted for by
non-specific perturbation, leaving little room for genuine introspective access.
No evidence of scalable introspection
The central question of this experiment is whether language models can detect when a steering
vector has been injected into their residual stream—a capacity that would constitute
genuine introspective access to their own internal states. Across three model scales,
we find no convincing evidence that they can. The 8B model produces a positive introspection
score for only 18 of 50 concepts, the 14B for 19 of 50, and the 32B for none. The mean
introspection score is negative at every scale (−0.52 for 8B, −0.35 for 14B,
−2.83 for 32B). If models had genuine introspective access, we would expect this
signal to strengthen with scale, as larger models develop richer internal representations.
Instead, the signal collapses entirely at 32B.
Steering vectors act as a generic YES-bias
The scatter plots in section 5 reveal the core mechanistic finding. For the 8B model,
80% of the variance in detection shift is explained by the factual shift alone
(r² = 0.80); for the 14B, this rises to 88%. This means that steering vectors
do not selectively trigger the model’s detection of injected concepts. Instead, they
function as a generic perturbation that pushes the model toward answering YES on
all questions—detection and factual alike. The tight clustering of points
along the identity line in the scatter plots is the visual signature of this non-specificity.
The 32B model shows a lower r² of 0.29, but this does not indicate selectivity.
Its detection shifts are near zero (mean 0.72 logits) while factual shifts are
large (mean 3.55 logits). The weak correlation simply reflects the fact that the 32B’s
detection response is essentially locked—it barely moves regardless of what is injected—while
factual reasoning is substantially disrupted.
Early layers allow more selective intervention
A striking pattern emerges from the 8B data when we examine per-layer correlations between
detection and factual shifts. At layer 5 (14% depth), the correlation is only
r = 0.40—detection and factual channels respond partially independently to
steering. By layer 35 (97% depth), the correlation reaches r = 0.996,
meaning any perturbation affects both channels identically. This gradient explains why the
best introspection scores appear at early-to-mid layers: representations at that depth are
abstract enough to be recognized but not yet committed to output, leaving room for the
detection signal to diverge from factual disruption. At later layers, the representations have
been compressed into a form where any perturbation propagates uniformly to all downstream predictions.
Baseline asymmetries across scale
The three models differ dramatically in their no-injection baselines. The 14B has the
strongest NO prior on both tasks (detection: −25.1, factual: −25.5),
making it extremely resistant to saying YES under any condition. The 8B sits in the middle
(detection: −15.5, factual: −17.8) with roughly matched baselines.
The 32B is the outlier: its detection baseline is only −2.4, while its factual baseline
is −8.9. This means the 32B model already considers the detection question genuinely
uncertain without any injection—likely reflecting greater model sophistication
in treating “were thoughts injected?” as plausible rather than dismissing it
reflexively. But this openness does not translate to introspection, because the model cannot
differentially shift its detection answer when injection actually occurs.
The peak layer is not stable across scale
The optimal layer for introspection shifts substantially between models: the 8B peaks at
28% depth (layer 10 / 36), the 14B at 55% (layer 22 / 40), and the 32B’s
only weakly positive layer sits at 14% (layer 9 / 64). If there were a
universal “introspection zone” at a fixed relative depth—a region where
models consistently access their own representations—we would expect the peak to
appear at roughly the same percentage across scales. It does not. The 32B’s slight
positive at its final layer (63 / 64, score +0.28) is trivially explained:
both detection and factual shifts are near zero there, so the small positive value reflects
noise in an essentially unperturbed system rather than genuine introspection.
Concrete concepts outperform abstract ones
Across models, concrete physical nouns consistently produce the highest introspection scores:
satellites (+2.45), oceans (+1.52), snow (+1.52),
and aquariums (+1.46) lead the 8B ranking, and oceans and aquariums
reappear in the 14B’s top five. The worst-performing concepts are consistently abstract
or social: secrecy (−2.86), youths (−2.54),
dynasties (−2.46), and monoliths (−2.54) anchor the
bottom across both 8B and 14B. In nearly every case, the negative introspection scores are
driven not by a failure to shift the detection logit, but by disproportionate disruption of
factual reasoning. For instance, the 8B’s secrecy vector produces a
respectable detection shift of +6.3 logits, but it disrupts factual answers by +9.2 logits.
Abstract steering vectors likely interfere more with general reasoning because their
representations are distributed across the same circuits the model uses for factual question-answering.
Stronger injection does not sharpen the signal
Increasing injection strength monotonically worsens introspection in all three models.
The 8B’s mean score declines from −0.29 at strength 3.5 to −0.84
at strength 6.0; the 32B drops from −2.09 to −3.56 over the same range.
Both detection and factual shifts grow with strength, but the factual shift grows faster.
If models had genuine introspective access, we might expect a sweet spot where stronger
injection produces a clearer detection signal. Instead, the signal-to-noise ratio degrades:
stronger steering simply amplifies the indiscriminate perturbation that affects all YES/NO
responses, without selectively enhancing the model’s ability to recognize the
injected concept.