-p mode with the model pinned, not direct API spend.Anthropic's Natural Language Autoencoder (NLA) proposes an unsupervised mechanistic-interpretability technique: an activation verbalizer (AV) writes a natural-language description of a residual-stream activation, and an activation reconstructor (AR) tries to recover the activation from that text. The language bottleneck forces the description to carry the information that matters; the AR's reconstruction loss provides the training signal for the AV via RL.
The technique is new, the reported results are striking, and we want hands-on understanding before deciding whether it applies to Pleiades / Pleiades 2. This ticket reproduces NLAs at toy scale on Othello-GPT (Li et al., 2022) — an 8-layer GPT-2 (dmodel=512) trained to predict legal Othello moves. Othello-GPT is the right toy because Neel Nanda's follow-up shows a linear probe recovers the model's emergent “mine/their/empty” board representation at >99% accuracy from layer ~4 onwards. That probe gives us ground truth to grade AV verbalizations against, which is the whole reason for picking this benchmark.
Single primary user: alfred-pm (researcher). Sub-users of the resulting artifact are interp-curious teammates reading the Notion writeup.
Run the joint AV/AR training loop end-to-end on Othello-GPT residuals at layer 6 and produce an FVE table on held-out games. Inspect typical verbalizations.
Compare reconstruction quality across three diagnostic ARs (prompt-only, verbalization-only, prompt+verbalization). Conclude how much “real” information the verbalization carries beyond what the game prefix already determines.
Sample ~50 positions, generate verbalizations, run an LLM-judge prompt over (verbalization, probe-derived ground truth) and report a hit-rate plus a handful of representative examples.
Read the FVE table + confabulation rate + cost estimate, write the go/no-go on Pleiades. This is the only "user-facing" output that actually matters.
| # | Priority | Requirement |
|---|---|---|
| FR1 | MUST | Load Othello-GPT via HookedTransformer.from_pretrained("othello-gpt") using Neel's synthetic checkpoint, and verify it can reproduce next-legal-move predictions on a held-out game prefix. |
| FR2 | MUST | Reproduce the linear-probe sanity check by loading the pre-trained main_linear_probe.pth from likenneth/othello_world and confirming per-square accuracy > 95% at layer 6 using the “mine/their/empty” reframe on a held-out set. If this fails, debug here before any NLA work. |
| FR3 | MUST | Build the activation dataset: ~30k synthetic Othello games × positions in [5, 54] → cached residuals at layer 6, unit-L2-normalized. Held-out 10% split for evaluation. |
| FR4 | MUST | Generate SFT warm-start data: ~5k (residual, summary) pairs where summaries are drafted via Claude Code CLI in batched --print mode (~20 positions per JSON-array prompt, ~250 invocations total, --model pinned for reproducibility). Each summary is conditioned on the move sequence + a description of what the model is computing at layer 6 (similar to the paper's “leading-the-witness” prompt). |
| FR5 | MUST | Train AV (Qwen2.5-1.5B): SFT warm-start on (residual → summary), then GRPO RL loop with G=8 rollouts/activation (with capacity to sweep G ∈ {8, 16, 32} given H200 headroom), T=1, KL coefficient β toward frozen base, output capped at ~150 tokens with length penalty. |
| FR6 | MUST | Train AR (text-only): SFT warm-start on (summary → residual) under MSE, then joint training with AV where each batch performs one AR MSE step and one AV GRPO step. AR is small enough (e.g., 4-layer transformer encoder + linear head) that it cannot trivially memorize the deterministic prompt→activation mapping. |
| FR7 | MUST | Implement activation injection for AV via a learned affine adapter ℝ512 → ℝdqwen applied to a single token position inside a fixed instruction prompt. No hand-tuned α (paper-recommended improvement). |
| FR8 | MUST | Report FVE = 1 − MSE / Var(hl) on the held-out split at multiple training-step checkpoints. Headline number is the FVE of the verbalization-only AR (the primary, paper-faithful setup). |
| FR9 | MUST | Report two diagnostic baselines alongside FR8: (a) prompt-only AR (move sequence → activation, no verbalization), (b) prompt+verbalization AR. Headline gap is (prompt+verb FVE) − (prompt-only FVE). |
| FR10 | MUST | Confabulation audit: pick ~50 board positions sampled across early/mid/late game phases, generate AV verbalizations, run a structured Claude Opus 4.6 LLM-judge prompt that scores each verbalization for (i) factual correctness of claimed squares vs. probe ground truth, (ii) coverage of present features. Report aggregate hit-rate + at least 8 worked examples. |
| FR11 | SHOULD | Steganography sanity check (from the paper): apply meaning-preserving transformations to AV outputs (paraphrase, bullet shuffle) and measure FVE drop. A small drop is healthy; a large drop suggests AV has begun encoding non-natural-language signal. |
| FR12 | SHOULD | Track writing-quality decay via LLM-judge score over training (paper reports steady decline since there's no explicit reward for legibility). Surface this in the writeup as one of the surprises / tradeoffs. |
| FR13 | MUST | Notion writeup capturing: setup, hparams table, FVE trajectory plots, confabulation hit-rate, 10 worked verbalization examples (5 good, 5 bad), and a one-sentence go/no-go recommendation on Pleiades with the reasoning. |
| FR14 | COULD | One small intervention experiment: edit a verbalization to claim a square is “mine” when the probe says it's “empty”, run AR on the edited text, check whether the predicted residual shifts in the direction the probe associates with the claimed feature. Pure curiosity, drops if time is tight. |
| Decision | Choice | Why |
|---|---|---|
| Target layer | Layer 6 of 8 (~2/3 depth) | Matches paper's "middle-to-late" prescription; also where Neel's mine/their/empty probe is most reliable (>99%). |
| Othello-GPT source | TransformerLens, synthetic-games checkpoint | One-line load, CI-tested, no need to retrain the base model. |
| AV backbone | Qwen2.5-1.5B | Same family as the paper's smallest open-model NLA (Qwen2.5-7B), 14× smaller in absolute terms but in the sweet spot where SFT warm-start “just works” without nursing. Fits trivially on a single H200 (141 GB) and leaves room for parallel ARs + KL sweep across the rest of the 8-node. |
| AV input format | Fixed instruction prompt with one <concept>A</concept> token whose embedding is replaced by the (affine-projected) Othello-GPT residual | Matches paper's injection scheme; the affine adapter replaces the paper's hand-tuned α (which the paper itself recommends). |
| RL algorithm | GRPO, group size G=8, T=1, length cap ~150 toks, length penalty on cap-hit | Simplest viable choice consistent with the paper. Ticket explicitly out-of-scope on PPO; GRPO is value-head-free and works at small scale. |
| KL anchor | Frozen post-SFT AV (not the original Qwen base) | Paper anchors to the SFT-initialized AV so KL preserves the learned format as well as base fluency. |
| AR architecture | Small text encoder (4–6 layer transformer, <50M params) + linear head → ℝ512; final-token pooled | Cannot share architecture with M (Othello-GPT speaks moves, not English). Capacity deliberately small so prompt-only baseline doesn't saturate. |
| AR loss target | MSE on unit-L2-normalized hl | Matches paper; normalization stabilizes scale across positions. |
| Joint training | Per batch: 1 AR MSE step + 1 AV GRPO step, gradients do not couple | Matches paper. Treats AR as a fixed reward model from AV's perspective and AV outputs as fixed inputs from AR's perspective. |
| SFT data source | Claude Code CLI in batched --print mode, model pinned (--model claude-opus-4-7 or sonnet); ~20 positions per call, ~250 invocations, ~5k pairs total | Avoids direct API spend by riding the included Claude Code subscription. Same model under the hood, so output quality is identical. Trade-off: less reproducible for re-runs than direct API; mitigated by pinning --model and persisting raw JSON outputs. |
| Compute split | Local Mac for FR1–FR3 + AR-baseline shakedown; 8×H200 Nebius node for AV training, parallel diagnostic ARs, and the KL sweep | Per user direction. H200 headroom (8×141 GB = 1.1 TB HBM) lets all four ARs (mean/prompt-only/verb-only/prompt+verb) train data-parallel rather than sequentially; KL coefficient sweep becomes a real task instead of a 500-step probe. |
| Confabulation grader | Claude Code CLI (model-pinned), structured prompt | Same rationale as SFT data source. 50 positions easily fits within CLI call budget; judge weight comes from the Opus-class model behind the CLI. |
The NLA paper's smallest open-model AV is Qwen2.5-7B. Our AV/M ratio at 1.5B (vs. Othello-GPT's ~25M) is already ~60× — far more generous in relative terms than any paper run. In absolute terms 1.5B is still 4.7× smaller than the paper's smallest, so we stay defensibly “toy”. The 0.5B variant works but typically requires more SFT-prompt nursing to keep outputs on-format; 1.5B reaches stable, on-format generations with a thinner SFT corpus, which matters because we are paying for SFT data with Claude Code CLI calls. 7B would be paper-faithful but slows GRPO throughput by ~5× relative to 1.5B without obviously buying us better signal at toy scale. Same Qwen2.5 family throughout makes lineage easy to cite in the Notion writeup.
| Risk | Why it matters | Mitigation |
|---|---|---|
| AV collapses without SFT warm-start | Paper says runs degenerate into garbled text without it — this is the single biggest "sole-reason-for-this-step" gotcha called out. | FR4 is a MUST; we never skip warm-start. Track AV writing-quality (FR12) at every checkpoint. |
| Prompt-only AR saturates → verbalization has no marginal info to add | Othello-GPT residuals are deterministic in the prompt. If the diagnostic AR is too big, headline gap collapses to zero. | AR deliberately small (<50M params, 4–6 layers). Measure prompt-only baseline early; shrink AR until baseline FVE is well below the joint setup. |
| KL collapse (AV either drifts off fluency or fails to specialize) | Paper notes β is the one knob actually tuned per model. | Sweep β ∈ {0.01, 0.05, 0.1, 0.5} on a 500-step probe run before committing to a long run. Track explanation legibility every N steps. |
| Steganography | Long-term failure mode the paper flags; turns natural-language descriptions into model-private codes. | FR11 paraphrase-invariance check serves as the canary. At toy scale this is unlikely but worth measuring once. |
| Confabulation | Paper reports verifiable-false-claim rate is high and flat through training. AV may "sound right" while being wrong about specific squares. | FR10's LLM judge is the headline audit. Report hit-rate honestly; failing here doesn't invalidate the spike — it's a finding. |
| Compute overrun on Nebius | RL is unpredictable. Paper's Gemma-3-27B reference is ~1.5 days on 16 H100s; toy scale on 8×H200 should be much faster (hours, not days), but headroom invites scope creep (longer sweeps, more seeds, etc.). | Hard cap: one full node-day on the 8×H200. If FVE is still flat against the SFT baseline at that point, stop, report partial results, and write up what went wrong. This is a spike, not a guaranteed-success ticket. |
| Activation injection scaling (no α, learned adapter instead) | Paper warns the α scalar is finicky. Our learned adapter is a deviation from the paper's published runs. | Initialize adapter so output norm matches Qwen's median embedding norm at the inject position; track activation-token influence on output logits to confirm the injection is doing work. |
| Off-by-one / index-space bugs in Othello-GPT plumbing | Two index conventions (string 0–63 vs. int 0–60) are a known day-killer; turn-parity in the probe is the second one. | Use the helpers from mech_interp_othello_utils.py (to_int, to_string, stoi_indices); add a smoke test that round-trips a known game and a known board state. |