TECH-3502 FRD — Toy NLA on Othello-GPT

Ticket: TECH-3502 · Type: Learning spike (5 pts) · Labels: interpretabilitySpike · Author: alfred-pm · Date: 2026-05-11

TL;DR

Reproduce Anthropic's Natural Language Autoencoder technique end-to-end on Othello-GPT as a learning spike, to decide whether to apply NLAs to Pleiades / Pleiades 2.
AV = Qwen2.5-1.5B trained with GRPO + KL-to-frozen-base on an 8×H200 Nebius node, sees an Othello-GPT layer-6 residual injected via a learned affine adapter (paper-recommended), emits 2–3 short snippets describing the activation.
AR = small text-only encoder + linear head → predicted residual in ℝ⁵¹²; primary metric is FVE on held-out activations, normalized to unit L2 to match the paper.
Diagnostic baselines reported alongside the primary AR: mean-prediction, prompt-only AR (move sequence → activation), and prompt + verbalization AR — the gap between these is the verbalization's marginal information. H200 headroom means all three are trained data-parallel rather than sequentially.
SFT warm-start summaries (~5k) and the confabulation LLM-judge pass both run via Claude Code CLI in batched -p mode with the model pinned, not direct API spend.
Confabulation check on ~50 positions spanning early/mid/late game, scored by an Opus-class LLM judge against linear-probe-derived board features (mine/their/empty).
Deliverable: a Notion writeup containing setup, hparams, FVE table, confabulation hit-rate, surprises, and a one-sentence go/no-go recommendation on applying NLAs to Pleiades.

1. Overview

Anthropic's Natural Language Autoencoder (NLA) proposes an unsupervised mechanistic-interpretability technique: an activation verbalizer (AV) writes a natural-language description of a residual-stream activation, and an activation reconstructor (AR) tries to recover the activation from that text. The language bottleneck forces the description to carry the information that matters; the AR's reconstruction loss provides the training signal for the AV via RL.

The technique is new, the reported results are striking, and we want hands-on understanding before deciding whether it applies to Pleiades / Pleiades 2. This ticket reproduces NLAs at toy scale on Othello-GPT (Li et al., 2022) — an 8-layer GPT-2 (d_model=512) trained to predict legal Othello moves. Othello-GPT is the right toy because Neel Nanda's follow-up shows a linear probe recovers the model's emergent “mine/their/empty” board representation at >99% accuracy from layer ~4 onwards. That probe gives us ground truth to grade AV verbalizations against, which is the whole reason for picking this benchmark.

2. Goals

3. Use Cases

Single primary user: alfred-pm (researcher). Sub-users of the resulting artifact are interp-curious teammates reading the Notion writeup.

UC1 — Reproduce the recipe

Run the joint AV/AR training loop end-to-end on Othello-GPT residuals at layer 6 and produce an FVE table on held-out games. Inspect typical verbalizations.

UC2 — Stress-test the bottleneck

Compare reconstruction quality across three diagnostic ARs (prompt-only, verbalization-only, prompt+verbalization). Conclude how much “real” information the verbalization carries beyond what the game prefix already determines.

UC3 — Audit for confabulation

Sample ~50 positions, generate verbalizations, run an LLM-judge prompt over (verbalization, probe-derived ground truth) and report a hit-rate plus a handful of representative examples.

UC4 — Decide

Read the FVE table + confabulation rate + cost estimate, write the go/no-go on Pleiades. This is the only "user-facing" output that actually matters.

4. Architecture (high level)

Probe oracle (not pictured): a frozen 3-class logistic on the same h_l giving per-square {mine, theirs, empty} ground truth; used only at eval time for the confabulation audit, never in the training loss.

5. Functional Requirements

6. Key Design Decisions

7. Non-Goals (explicit)

8. Technical Considerations & Risks

9. Open Questions (deferred to Plan)

10. Acceptance Criteria (from the ticket)

Approval: see Linear ticket for sign-off. After FRD approval, the implementation plan (PLAN.md) will be authored in a separate document and will include the phase breakdown, execution DAG, and complexity tiers.

#	Priority	Requirement
FR1	MUST	Load Othello-GPT via `HookedTransformer.from_pretrained("othello-gpt")` using Neel's synthetic checkpoint, and verify it can reproduce next-legal-move predictions on a held-out game prefix.
FR2	MUST	Reproduce the linear-probe sanity check by loading the pre-trained `main_linear_probe.pth` from `likenneth/othello_world` and confirming per-square accuracy > 95% at layer 6 using the “mine/their/empty” reframe on a held-out set. If this fails, debug here before any NLA work.
FR3	MUST	Build the activation dataset: ~30k synthetic Othello games × positions in [5, 54] → cached residuals at layer 6, unit-L2-normalized. Held-out 10% split for evaluation.
FR4	MUST	Generate SFT warm-start data: ~5k (residual, summary) pairs where summaries are drafted via Claude Code CLI in batched `--print` mode (~20 positions per JSON-array prompt, ~250 invocations total, `--model` pinned for reproducibility). Each summary is conditioned on the move sequence + a description of what the model is computing at layer 6 (similar to the paper's “leading-the-witness” prompt).
FR5	MUST	Train AV (Qwen2.5-1.5B): SFT warm-start on (residual → summary), then GRPO RL loop with G=8 rollouts/activation (with capacity to sweep G ∈ {8, 16, 32} given H200 headroom), T=1, KL coefficient β toward frozen base, output capped at ~150 tokens with length penalty.
FR6	MUST	Train AR (text-only): SFT warm-start on (summary → residual) under MSE, then joint training with AV where each batch performs one AR MSE step and one AV GRPO step. AR is small enough (e.g., 4-layer transformer encoder + linear head) that it cannot trivially memorize the deterministic prompt→activation mapping.
FR7	MUST	Implement activation injection for AV via a learned affine adapter ℝ⁵¹² → ℝ^d_qwen applied to a single token position inside a fixed instruction prompt. No hand-tuned α (paper-recommended improvement).
FR8	MUST	Report FVE = 1 − MSE / Var(h_l) on the held-out split at multiple training-step checkpoints. Headline number is the FVE of the verbalization-only AR (the primary, paper-faithful setup).
FR9	MUST	Report two diagnostic baselines alongside FR8: (a) prompt-only AR (move sequence → activation, no verbalization), (b) prompt+verbalization AR. Headline gap is (prompt+verb FVE) − (prompt-only FVE).
FR10	MUST	Confabulation audit: pick ~50 board positions sampled across early/mid/late game phases, generate AV verbalizations, run a structured Claude Opus 4.6 LLM-judge prompt that scores each verbalization for (i) factual correctness of claimed squares vs. probe ground truth, (ii) coverage of present features. Report aggregate hit-rate + at least 8 worked examples.
FR11	SHOULD	Steganography sanity check (from the paper): apply meaning-preserving transformations to AV outputs (paraphrase, bullet shuffle) and measure FVE drop. A small drop is healthy; a large drop suggests AV has begun encoding non-natural-language signal.
FR12	SHOULD	Track writing-quality decay via LLM-judge score over training (paper reports steady decline since there's no explicit reward for legibility). Surface this in the writeup as one of the surprises / tradeoffs.
FR13	MUST	Notion writeup capturing: setup, hparams table, FVE trajectory plots, confabulation hit-rate, 10 worked verbalization examples (5 good, 5 bad), and a one-sentence go/no-go recommendation on Pleiades with the reasoning.
FR14	COULD	One small intervention experiment: edit a verbalization to claim a square is “mine” when the probe says it's “empty”, run AR on the edited text, check whether the predicted residual shifts in the direction the probe associates with the claimed feature. Pure curiosity, drops if time is tight.

Decision	Choice	Why
Target layer	Layer 6 of 8 (~2/3 depth)	Matches paper's "middle-to-late" prescription; also where Neel's mine/their/empty probe is most reliable (>99%).
Othello-GPT source	TransformerLens, synthetic-games checkpoint	One-line load, CI-tested, no need to retrain the base model.
AV backbone	Qwen2.5-1.5B	Same family as the paper's smallest open-model NLA (Qwen2.5-7B), 14× smaller in absolute terms but in the sweet spot where SFT warm-start “just works” without nursing. Fits trivially on a single H200 (141 GB) and leaves room for parallel ARs + KL sweep across the rest of the 8-node.
AV input format	Fixed instruction prompt with one `<concept>A</concept>` token whose embedding is replaced by the (affine-projected) Othello-GPT residual	Matches paper's injection scheme; the affine adapter replaces the paper's hand-tuned α (which the paper itself recommends).
RL algorithm	GRPO, group size G=8, T=1, length cap ~150 toks, length penalty on cap-hit	Simplest viable choice consistent with the paper. Ticket explicitly out-of-scope on PPO; GRPO is value-head-free and works at small scale.
KL anchor	Frozen post-SFT AV (not the original Qwen base)	Paper anchors to the SFT-initialized AV so KL preserves the learned format as well as base fluency.
AR architecture	Small text encoder (4–6 layer transformer, <50M params) + linear head → ℝ⁵¹²; final-token pooled	Cannot share architecture with M (Othello-GPT speaks moves, not English). Capacity deliberately small so prompt-only baseline doesn't saturate.
AR loss target	MSE on unit-L2-normalized h_l	Matches paper; normalization stabilizes scale across positions.
Joint training	Per batch: 1 AR MSE step + 1 AV GRPO step, gradients do not couple	Matches paper. Treats AR as a fixed reward model from AV's perspective and AV outputs as fixed inputs from AR's perspective.
SFT data source	Claude Code CLI in batched `--print` mode, model pinned (`--model claude-opus-4-7` or `sonnet`); ~20 positions per call, ~250 invocations, ~5k pairs total	Avoids direct API spend by riding the included Claude Code subscription. Same model under the hood, so output quality is identical. Trade-off: less reproducible for re-runs than direct API; mitigated by pinning `--model` and persisting raw JSON outputs.
Compute split	Local Mac for FR1–FR3 + AR-baseline shakedown; 8×H200 Nebius node for AV training, parallel diagnostic ARs, and the KL sweep	Per user direction. H200 headroom (8×141 GB = 1.1 TB HBM) lets all four ARs (mean/prompt-only/verb-only/prompt+verb) train data-parallel rather than sequentially; KL coefficient sweep becomes a real task instead of a 500-step probe.
Confabulation grader	Claude Code CLI (model-pinned), structured prompt	Same rationale as SFT data source. 50 positions easily fits within CLI call budget; judge weight comes from the Opus-class model behind the CLI.

Risk	Why it matters	Mitigation
AV collapses without SFT warm-start	Paper says runs degenerate into garbled text without it — this is the single biggest "sole-reason-for-this-step" gotcha called out.	FR4 is a MUST; we never skip warm-start. Track AV writing-quality (FR12) at every checkpoint.
Prompt-only AR saturates → verbalization has no marginal info to add	Othello-GPT residuals are deterministic in the prompt. If the diagnostic AR is too big, headline gap collapses to zero.	AR deliberately small (<50M params, 4–6 layers). Measure prompt-only baseline early; shrink AR until baseline FVE is well below the joint setup.
KL collapse (AV either drifts off fluency or fails to specialize)	Paper notes β is the one knob actually tuned per model.	Sweep β ∈ {0.01, 0.05, 0.1, 0.5} on a 500-step probe run before committing to a long run. Track explanation legibility every N steps.
Steganography	Long-term failure mode the paper flags; turns natural-language descriptions into model-private codes.	FR11 paraphrase-invariance check serves as the canary. At toy scale this is unlikely but worth measuring once.
Confabulation	Paper reports verifiable-false-claim rate is high and flat through training. AV may "sound right" while being wrong about specific squares.	FR10's LLM judge is the headline audit. Report hit-rate honestly; failing here doesn't invalidate the spike — it's a finding.
Compute overrun on Nebius	RL is unpredictable. Paper's Gemma-3-27B reference is ~1.5 days on 16 H100s; toy scale on 8×H200 should be much faster (hours, not days), but headroom invites scope creep (longer sweeps, more seeds, etc.).	Hard cap: one full node-day on the 8×H200. If FVE is still flat against the SFT baseline at that point, stop, report partial results, and write up what went wrong. This is a spike, not a guaranteed-success ticket.
Activation injection scaling (no α, learned adapter instead)	Paper warns the α scalar is finicky. Our learned adapter is a deviation from the paper's published runs.	Initialize adapter so output norm matches Qwen's median embedding norm at the inject position; track activation-token influence on output logits to confirm the injection is doing work.
Off-by-one / index-space bugs in Othello-GPT plumbing	Two index conventions (string 0–63 vs. int 0–60) are a known day-killer; turn-parity in the probe is the second one.	Use the helpers from `mech_interp_othello_utils.py` (`to_int`, `to_string`, `stoi_indices`); add a smoke test that round-trips a known game and a known board state.

TECH-3502 — FRD: Toy NLA on Othello-GPT