Introspection Experiment Results

Agastya Sridharan

Inspired by Anthropic’s original paper and this codebase’s experiments.

The Logit Experiment. Instead of generating free-text responses and grading them with an external LLM, this experiment extracts the model’s YES/NO logit difference in a single deterministic forward pass. This removes all sampling noise, grader subjectivity, and prompt-compliance confounds.

2×2 Design. We cross two question types with two injection conditions. The detection question asks “Did you detect an injected thought?” The factual control questions ask obvious-answer questions (e.g. “Is the Earth flat?”) whose correct answer is always NO. Each question type is run both with and without a steering-vector injection. If the model has genuine introspective access, its detection logits should shift more than its factual logits.

Introspection score. We baseline-correct each condition by subtracting the no-injection logit difference, then define: introspection score = adjusted detection shift − adjusted factual shift. A positive score means the detection question is more affected by injection than the factual control, which would suggest introspective access. A score near zero means both question types shift equally, consistent with a generic perturbation that biases all YES/NO answers rather than genuine detection.

1. Introspection Score vs Layer Position
This plot shows the baseline-corrected introspection score, averaged across all 50 concepts and all injection strengths, as a function of layer position. The introspection score is defined as the detection logit shift minus the factual logit shift, where both shifts are corrected by subtracting the corresponding no-injection baseline. A positive score means that injection pushes the model toward answering YES on the detection question more than it pushes the model toward answering YES on unrelated factual questions, which would constitute evidence for genuine introspective access. A score near or below zero means the model shows no differential detection signal beyond generic perturbation.
2. Introspection Score by Injection Strength
This grid breaks out the introspection score by injection strength, with one panel per strength value. Each line represents a different model scale. Comparing panels reveals whether stronger injections produce a clearer introspection signal or simply cause more indiscriminate disruption across both detection and factual conditions.
3. Shift Decomposition: Detection vs Factual
This plot shows both baseline-corrected shifts on the same axes: the red line is how much injection shifts the detection question toward YES, and the blue line is how much it shifts the factual control questions toward YES. The vertical gap between the two lines is the introspection score. When the two lines track each other closely, the steering vector is acting as a generic YES-bias that affects all questions equally, rather than selectively triggering the model’s detection of the injected concept. Use the dropdown to compare this decomposition across model scales.
4. Introspection Heatmap
Each heatmap shows the mean introspection score as a function of layer position (x-axis) and injection strength (y-axis) for one model. The colorscale is diverging and centered at zero: green cells indicate configurations where injection shifts detection more than factual (positive introspection), while red cells indicate configurations where injection disrupts factual accuracy more than it helps detection (negative introspection). A predominantly red heatmap means the model shows no evidence of introspective access at any layer or strength.
5. Detection–Factual Correlation
Each dot in this scatter plot represents a single (concept, layer, strength) combination, with the factual logit shift on the x-axis and the detection logit shift on the y-axis. If the steering vector were selectively triggering introspection, the detection shift would be large while the factual shift remains small, and the points would lie well above the identity line. If, however, the two shifts are driven by the same mechanism (a generic YES-bias that pushes the model toward affirmative answers regardless of the question), the points will cluster tightly along the diagonal. The r² annotation in each panel quantifies how much of the variance in detection shift is explained by the factual shift alone: values near 1.0 indicate that nearly all of the detection signal can be accounted for by non-specific perturbation, leaving little room for genuine introspective access.
6. False Positive Rates: Detection vs Factual
Percentage of trials where the model’s logit difference favors YES (logitdiff > 0) under steering-vector injection, for both detection and factual control questions. This is the logit-space analog of Anthropic’s Figure 12.
7. Regression to Uncertainty
Each point represents one question (1 detection + 10 factual per model = 33 points total). The x-axis is the no-injection baseline logit difference; the y-axis is the mean shift under steering-vector injection. The dashed line is the least-squares fit to the 30 factual points only. Detection points (diamonds) sit on this line, meaning their shifts are fully predicted by their baselines alone, with no residual signal attributable to introspective access. The near-perfect fit (R² = 0.96) reveals the mechanism: steering vectors push logit differences toward zero (maximum uncertainty), with magnitude proportional to distance from zero. From a strongly-NO baseline, this looks like a YES-bias. From a YES baseline (as in the inverted experiment), it looks like a NO-bias. In reality, it is simply compression toward indifference.