TECH-3502 — FRD: Toy NLA on Othello-GPT

Ticket: TECH-3502 · Type: Learning spike (5 pts) · Labels: interpretabilitySpike · Author: alfred-pm · Date: 2026-05-11

TL;DR

1. Overview

Anthropic's Natural Language Autoencoder (NLA) proposes an unsupervised mechanistic-interpretability technique: an activation verbalizer (AV) writes a natural-language description of a residual-stream activation, and an activation reconstructor (AR) tries to recover the activation from that text. The language bottleneck forces the description to carry the information that matters; the AR's reconstruction loss provides the training signal for the AV via RL.

The technique is new, the reported results are striking, and we want hands-on understanding before deciding whether it applies to Pleiades / Pleiades 2. This ticket reproduces NLAs at toy scale on Othello-GPT (Li et al., 2022) — an 8-layer GPT-2 (dmodel=512) trained to predict legal Othello moves. Othello-GPT is the right toy because Neel Nanda's follow-up shows a linear probe recovers the model's emergent “mine/their/empty” board representation at >99% accuracy from layer ~4 onwards. That probe gives us ground truth to grade AV verbalizations against, which is the whole reason for picking this benchmark.

2. Goals

3. Use Cases

Single primary user: alfred-pm (researcher). Sub-users of the resulting artifact are interp-curious teammates reading the Notion writeup.

UC1 — Reproduce the recipe

Run the joint AV/AR training loop end-to-end on Othello-GPT residuals at layer 6 and produce an FVE table on held-out games. Inspect typical verbalizations.

UC2 — Stress-test the bottleneck

Compare reconstruction quality across three diagnostic ARs (prompt-only, verbalization-only, prompt+verbalization). Conclude how much “real” information the verbalization carries beyond what the game prefix already determines.

UC3 — Audit for confabulation

Sample ~50 positions, generate verbalizations, run an LLM-judge prompt over (verbalization, probe-derived ground truth) and report a hit-rate plus a handful of representative examples.

UC4 — Decide

Read the FVE table + confabulation rate + cost estimate, write the go/no-go on Pleiades. This is the only "user-facing" output that actually matters.

4. Architecture (high level)

Othello-GPT 8L, d=512, layer 6 prompt = move seq hl ∈ ℝ512 AV: Qwen2.5-1.5B + GRPO affine adapter on inject token z (~150 toks) AR: text encoder + linear head → ℝ512 trained jointly w/ MSE ĥl reward = −log ||hl − ĥl||2  →  GRPO step on AV; AR step on MSE KL(AVφ ‖ AVinit) penalty preserves fluency

Probe oracle (not pictured): a frozen 3-class logistic on the same hl giving per-square {mine, theirs, empty} ground truth; used only at eval time for the confabulation audit, never in the training loss.

5. Functional Requirements

#PriorityRequirement
FR1MUSTLoad Othello-GPT via HookedTransformer.from_pretrained("othello-gpt") using Neel's synthetic checkpoint, and verify it can reproduce next-legal-move predictions on a held-out game prefix.
FR2MUSTReproduce the linear-probe sanity check by loading the pre-trained main_linear_probe.pth from likenneth/othello_world and confirming per-square accuracy > 95% at layer 6 using the “mine/their/empty” reframe on a held-out set. If this fails, debug here before any NLA work.
FR3MUSTBuild the activation dataset: ~30k synthetic Othello games × positions in [5, 54] → cached residuals at layer 6, unit-L2-normalized. Held-out 10% split for evaluation.
FR4MUSTGenerate SFT warm-start data: ~5k (residual, summary) pairs where summaries are drafted via Claude Code CLI in batched --print mode (~20 positions per JSON-array prompt, ~250 invocations total, --model pinned for reproducibility). Each summary is conditioned on the move sequence + a description of what the model is computing at layer 6 (similar to the paper's “leading-the-witness” prompt).
FR5MUSTTrain AV (Qwen2.5-1.5B): SFT warm-start on (residual → summary), then GRPO RL loop with G=8 rollouts/activation (with capacity to sweep G ∈ {8, 16, 32} given H200 headroom), T=1, KL coefficient β toward frozen base, output capped at ~150 tokens with length penalty.
FR6MUSTTrain AR (text-only): SFT warm-start on (summary → residual) under MSE, then joint training with AV where each batch performs one AR MSE step and one AV GRPO step. AR is small enough (e.g., 4-layer transformer encoder + linear head) that it cannot trivially memorize the deterministic prompt→activation mapping.
FR7MUSTImplement activation injection for AV via a learned affine adapter 512 → ℝdqwen applied to a single token position inside a fixed instruction prompt. No hand-tuned α (paper-recommended improvement).
FR8MUSTReport FVE = 1 − MSE / Var(hl) on the held-out split at multiple training-step checkpoints. Headline number is the FVE of the verbalization-only AR (the primary, paper-faithful setup).
FR9MUSTReport two diagnostic baselines alongside FR8: (a) prompt-only AR (move sequence → activation, no verbalization), (b) prompt+verbalization AR. Headline gap is (prompt+verb FVE) − (prompt-only FVE).
FR10MUSTConfabulation audit: pick ~50 board positions sampled across early/mid/late game phases, generate AV verbalizations, run a structured Claude Opus 4.6 LLM-judge prompt that scores each verbalization for (i) factual correctness of claimed squares vs. probe ground truth, (ii) coverage of present features. Report aggregate hit-rate + at least 8 worked examples.
FR11SHOULDSteganography sanity check (from the paper): apply meaning-preserving transformations to AV outputs (paraphrase, bullet shuffle) and measure FVE drop. A small drop is healthy; a large drop suggests AV has begun encoding non-natural-language signal.
FR12SHOULDTrack writing-quality decay via LLM-judge score over training (paper reports steady decline since there's no explicit reward for legibility). Surface this in the writeup as one of the surprises / tradeoffs.
FR13MUSTNotion writeup capturing: setup, hparams table, FVE trajectory plots, confabulation hit-rate, 10 worked verbalization examples (5 good, 5 bad), and a one-sentence go/no-go recommendation on Pleiades with the reasoning.
FR14COULDOne small intervention experiment: edit a verbalization to claim a square is “mine” when the probe says it's “empty”, run AR on the edited text, check whether the predicted residual shifts in the direction the probe associates with the claimed feature. Pure curiosity, drops if time is tight.

6. Key Design Decisions

DecisionChoiceWhy
Target layerLayer 6 of 8 (~2/3 depth)Matches paper's "middle-to-late" prescription; also where Neel's mine/their/empty probe is most reliable (>99%).
Othello-GPT sourceTransformerLens, synthetic-games checkpointOne-line load, CI-tested, no need to retrain the base model.
AV backboneQwen2.5-1.5BSame family as the paper's smallest open-model NLA (Qwen2.5-7B), 14× smaller in absolute terms but in the sweet spot where SFT warm-start “just works” without nursing. Fits trivially on a single H200 (141 GB) and leaves room for parallel ARs + KL sweep across the rest of the 8-node.
AV input formatFixed instruction prompt with one <concept>A</concept> token whose embedding is replaced by the (affine-projected) Othello-GPT residualMatches paper's injection scheme; the affine adapter replaces the paper's hand-tuned α (which the paper itself recommends).
RL algorithmGRPO, group size G=8, T=1, length cap ~150 toks, length penalty on cap-hitSimplest viable choice consistent with the paper. Ticket explicitly out-of-scope on PPO; GRPO is value-head-free and works at small scale.
KL anchorFrozen post-SFT AV (not the original Qwen base)Paper anchors to the SFT-initialized AV so KL preserves the learned format as well as base fluency.
AR architectureSmall text encoder (4–6 layer transformer, <50M params) + linear head → ℝ512; final-token pooledCannot share architecture with M (Othello-GPT speaks moves, not English). Capacity deliberately small so prompt-only baseline doesn't saturate.
AR loss targetMSE on unit-L2-normalized hlMatches paper; normalization stabilizes scale across positions.
Joint trainingPer batch: 1 AR MSE step + 1 AV GRPO step, gradients do not coupleMatches paper. Treats AR as a fixed reward model from AV's perspective and AV outputs as fixed inputs from AR's perspective.
SFT data sourceClaude Code CLI in batched --print mode, model pinned (--model claude-opus-4-7 or sonnet); ~20 positions per call, ~250 invocations, ~5k pairs totalAvoids direct API spend by riding the included Claude Code subscription. Same model under the hood, so output quality is identical. Trade-off: less reproducible for re-runs than direct API; mitigated by pinning --model and persisting raw JSON outputs.
Compute splitLocal Mac for FR1–FR3 + AR-baseline shakedown; 8×H200 Nebius node for AV training, parallel diagnostic ARs, and the KL sweepPer user direction. H200 headroom (8×141 GB = 1.1 TB HBM) lets all four ARs (mean/prompt-only/verb-only/prompt+verb) train data-parallel rather than sequentially; KL coefficient sweep becomes a real task instead of a 500-step probe.
Confabulation graderClaude Code CLI (model-pinned), structured promptSame rationale as SFT data source. 50 positions easily fits within CLI call budget; judge weight comes from the Opus-class model behind the CLI.
Why Qwen2.5-1.5B specifically (not 0.5B, not 7B)

The NLA paper's smallest open-model AV is Qwen2.5-7B. Our AV/M ratio at 1.5B (vs. Othello-GPT's ~25M) is already ~60× — far more generous in relative terms than any paper run. In absolute terms 1.5B is still 4.7× smaller than the paper's smallest, so we stay defensibly “toy”. The 0.5B variant works but typically requires more SFT-prompt nursing to keep outputs on-format; 1.5B reaches stable, on-format generations with a thinner SFT corpus, which matters because we are paying for SFT data with Claude Code CLI calls. 7B would be paper-faithful but slows GRPO throughput by ~5× relative to 1.5B without obviously buying us better signal at toy scale. Same Qwen2.5 family throughout makes lineage easy to cite in the Notion writeup.

7. Non-Goals (explicit)

8. Technical Considerations & Risks

RiskWhy it mattersMitigation
AV collapses without SFT warm-startPaper says runs degenerate into garbled text without it — this is the single biggest "sole-reason-for-this-step" gotcha called out.FR4 is a MUST; we never skip warm-start. Track AV writing-quality (FR12) at every checkpoint.
Prompt-only AR saturates → verbalization has no marginal info to addOthello-GPT residuals are deterministic in the prompt. If the diagnostic AR is too big, headline gap collapses to zero.AR deliberately small (<50M params, 4–6 layers). Measure prompt-only baseline early; shrink AR until baseline FVE is well below the joint setup.
KL collapse (AV either drifts off fluency or fails to specialize)Paper notes β is the one knob actually tuned per model.Sweep β ∈ {0.01, 0.05, 0.1, 0.5} on a 500-step probe run before committing to a long run. Track explanation legibility every N steps.
SteganographyLong-term failure mode the paper flags; turns natural-language descriptions into model-private codes.FR11 paraphrase-invariance check serves as the canary. At toy scale this is unlikely but worth measuring once.
ConfabulationPaper reports verifiable-false-claim rate is high and flat through training. AV may "sound right" while being wrong about specific squares.FR10's LLM judge is the headline audit. Report hit-rate honestly; failing here doesn't invalidate the spike — it's a finding.
Compute overrun on NebiusRL is unpredictable. Paper's Gemma-3-27B reference is ~1.5 days on 16 H100s; toy scale on 8×H200 should be much faster (hours, not days), but headroom invites scope creep (longer sweeps, more seeds, etc.).Hard cap: one full node-day on the 8×H200. If FVE is still flat against the SFT baseline at that point, stop, report partial results, and write up what went wrong. This is a spike, not a guaranteed-success ticket.
Activation injection scaling (no α, learned adapter instead)Paper warns the α scalar is finicky. Our learned adapter is a deviation from the paper's published runs.Initialize adapter so output norm matches Qwen's median embedding norm at the inject position; track activation-token influence on output logits to confirm the injection is doing work.
Off-by-one / index-space bugs in Othello-GPT plumbingTwo index conventions (string 0–63 vs. int 0–60) are a known day-killer; turn-parity in the probe is the second one.Use the helpers from mech_interp_othello_utils.py (to_int, to_string, stoi_indices); add a smoke test that round-trips a known game and a known board state.

9. Open Questions (deferred to Plan)

10. Acceptance Criteria (from the ticket)

  1. FR1+FR2 — Othello-GPT loaded and probe accuracy >95% at layer 6 reproduced.
  2. FR5+FR6 — AV+AR trained on layer-6 residuals.
  3. FR8+FR9 — FVE reported with mean / prompt-only / verb-only / prompt+verb baselines.
  4. FR10 — Confabulation hit-rate vs probe-derived features, ≈50 positions, LLM-judge scored.
  5. FR13 — Notion writeup with go/no-go.

Approval: see Linear ticket for sign-off. After FRD approval, the implementation plan (PLAN.md) will be authored in a separate document and will include the phase breakdown, execution DAG, and complexity tiers.