Total expected cost: ~10–12 H200-node-hours for the headline run + KL sweep; hard cap at one full node-day.
Testing strategy: smoke tests at every phase boundary + FVE table & LLM-judge confabulation hit-rate as the headline acceptance gate.
Top execution risks (covered by mitigations in each phase): AV-collapse-without-warm-start, KL coefficient instability, prompt-only AR baseline saturating to ceiling (making the verbalization-marginal-info gap collapse).
1. Overview
This plan operationalizes the FRD. The end state is a runnable, reproducible NLA training loop on Othello-GPT residuals at layer 6, an FVE table comparing four diagnostic reconstructor configurations, a confabulation audit on ~50 positions scored by an LLM judge, and a Notion writeup containing a one-sentence go / no-go on applying NLAs to Pleiades.
2. FRD Requirement → Phase Mapping
FR
Requirement (summary)
Phase
FR1
Load Othello-GPT via TransformerLens, verify next-legal-move sanity
.gitignore — already configured to ignore data/, checkpoints/, *.pt, *.safetensors, etc.
No source code, no tests, no scaffolding beyond the Python project shell. Everything below is new.
4. Desired End State
src/nla/ package with submodules for data, models, training, eval.
scripts/0[1-7]_*.py orchestration scripts, one per major step, runnable independently.
tests/ with smoke tests gating each phase boundary.
results/fve_table.json with FVE numbers for {mean, prompt-only AR, verb-only AR, prompt+verb AR} × {SFT-only, RL-steps {1k, 2k, 3k, final}}.
results/confab_judge.jsonl with per-position judge scores + ~10 worked examples.
results/steg_sanity.json with FVE delta under paraphrase + bullet-shuffle of AV outputs.
A Notion page (linked from the writeup file) containing setup, hparams, FVE plots, surprises, and the go/no-go.
All training logs in W&B for posterity.
5. What We're Not Doing
Inherits from the FRD's Non-Goals section. Summary:
No PPO. GRPO only.
No multi-layer NLA. Layer 6 only.
No production-quality utilities or shared libraries.
No application to Pleiades / Pleiades 2 within this ticket (that's the artifact's recommendation, not the spike's work).
No retraining of Othello-GPT — use Neel's released synthetic-games checkpoint.
No matching frontier-scale FVE numbers — trajectory + behavioral properties matter, not absolute level.
No agent-based auditor reproductions (paper's auditor-agent / Claude Code scaffold).
6. Implementation Approach
The work is mostly serial because each phase depends on artifacts produced by the previous one: model load → activation cache → SFT data → SFT-warmed AV/AR → joint RL → eval → writeup. The single useful parallelization is in Wave 5: once SFT data exists, AV SFT (P5) and the four ARs' SFT (P6) train on disjoint GPUs of the 8×H200 node simultaneously.
The single critical phase is P7 (joint GRPO+MSE RL). Everything before P7 is "plumbing the experiment"; everything after P7 is "report what happened." If P7 fails (KL collapse, AV degeneration, runaway divergence), the spike still has a story to tell — that becomes the surprise / lesson and feeds the go/no-go.
7. Execution DAG
Phases at the same vertical level can run in parallel. Dashed arrows show optional dependencies (e.g., P8 reads checkpoints during P7 for intermediate FVE plots).
8. Phases
trivialP1 — Dependencies + repo bootstrap
Goal: Make the repo runnable. Add deps, package skeleton, smoke test that everything imports.
Goal: Confirm the entire downstream story is plausible by reproducing the published probe-accuracy numbers before we spend compute on activations.
Changes Required:
src/nla/data/othello_gpt.py — load_model() -> HookedTransformer; loads "othello-gpt" via TransformerLens and replaces state dict from Neel's NeelNanda/Othello-GPT-Transformer-Lens/synthetic_model.pth. Returns model + tokenizer helpers (re-exports to_int, to_string, stoi_indices from mech_interp_othello_utils.py).
src/nla/data/board.py — vendored OthelloBoardState + seq_to_state_stack() from likenneth/othello_world (single file copy; smaller dep surface than installing the whole repo).
src/nla/eval/probe.py — load_probe(), apply_probe(resid, probe) -> (3-class logits per square), mine_their_empty_accuracy(...). Loads pretrained main_linear_probe.pth from the same HF repo.
scripts/01_load_and_verify.py — runs the legal-move sanity check on a known game + probe-accuracy sanity check at layer 6 on 1k held-out positions.
tests/test_othello_load.py, tests/test_probe.py — small fixtures, deterministic.
Success Criteria:
Automated: probe per-square accuracy ≥ 95% at layer 6 across 1k held-out positions; legal-move-prediction top-1 ≥ 99% (well known for Othello-GPT).
Manual: spot-check a known game (e.g., one from board_seqs_int_small.npy) — replay the move sequence, run the model, eyeball that "mine/their/empty" prediction matches the true board.
Gotchas to watch (from the research brief): two index spaces (string 0–63 vs. int 0–60), turn-parity in the probe (use tl_probing_v1.py's mode-split idiom), off-by-one on context length (n_ctx=59).
standardP3 — Activation dataset @ layer 6 (FR3)
Goal: A clean, reusable cache of layer-6 residuals on ~30k synthetic Othello games. Single source of truth for all downstream training.
Changes Required:
src/nla/data/activations.py — build_activation_cache(games, layer, positions_window=(5, 54), batch_size=64) -> Path. Returns path to memmapped tensor on disk.
scripts/02_cache_activations.py — loads ~30k games from board_seqs_int.npy (full set; not the _small 6 MB version), caches residuals at layer 6, unit-L2-normalizes, splits 90/10 train/eval by game (no leakage at the position level).
tests/test_activations.py — verifies tensor shape (N, 512), every row has unit L2 norm (to within 1e-5), eval split has no game overlap with train.
Manual: distribution of pre-normalization norms is visualized (one notebook cell) and looks sensible (no outlier blowups suggesting bad games).
Notes: ~30k games × ~50 positions × 512 d_model in fp16 = ~1.5 GB on disk per layer. Negligible.
complexP4 — SFT warm-start data via Claude Code CLI (FR4)
Goal: ~5k (residual, summary) pairs where each summary is a 2–3-snippet description of "what Othello-GPT is computing at layer 6 for this position." This is the single most important risk-reducer for AV training — the paper notes runs collapse without it.
Changes Required:
src/nla/data/board_describe.py — given a move sequence + board state, deterministically extract "ground truth features" the model plausibly represents at layer 6: mine/their/empty per square, recently-played squares, simple tactical primitives (edges, corners controlled). This is what we feed Claude as the "leading-the-witness" context.
prompts/sft_summary.md — the SFT prompt template. Verbatim spec (paraphrased): "You are explaining what an 8-layer GPT-2 trained on Othello move prediction is computing in its residual stream at layer 6 for the given position. Write 2–3 short snippets (each ≤ 30 tokens) describing the features the model is likely tracking. Be specific about squares (e.g., 'd3 is mine, e4 is theirs'). Do not narrate the game history."
src/nla/data/sft_summaries.py — batch_describe(positions: list, batch_size=20, model="claude-opus-4-7") -> list[Summary]. Builds a JSON-array prompt over 20 positions, invokes claude --print --model <pinned> as a subprocess, parses the JSON response. Retry on parse failure (max 3); pace at ~5 calls/min to stay under Max-plan limits.
scripts/03_generate_sft_data.py — sample 5k random positions from the train split (P3), build the prompts, batch-invoke CLI, write data/sft/summaries.jsonl.
tests/test_sft_smoke.py — generate 5 summaries end-to-end (skipped in CI if CLAUDE_CLI_AVAILABLE unset); verify JSON shape.
Success Criteria:
Automated: 5k rows produced; every row parses as valid JSON with the expected fields; no truncated outputs.
Manual: sample 20 rows and grade by eye — do summaries mention probe-derived features at least directionally correctly? If < 70%, the prompt needs work before SFT.
Risks: CLI rate limits, model-version drift between calls (mitigated by pinning --model), batched-JSON parse failures (mitigated by retry + small batch size).
complexP5 — AV with activation injection (SFT warm-start) (FR5 SFT half, FR7)
Goal: A Qwen2.5-1.5B that, after SFT, produces a paper-style 2–3-snippet description when handed an Othello-GPT layer-6 residual.
Changes Required:
src/nla/models/adapter.py — ActivationAdapter(nn.Module): learned affine ℝ512 → ℝd_qwen=1536. Initialized so output norm at init matches Qwen's median embedding-token norm at the inject position.
src/nla/models/av.py — AV wraps Qwen2.5-1.5B; loads base via HF; defines the fixed instruction prompt (paper's structure with one <concept>A</concept> token); patches the embedding of that single token at the forward pass with adapter(h_l).
src/nla/training/sft.py — generic SFT loop usable by both AV and AR. Streams from JSONL, computes cross-entropy (AV) or MSE (AR). LR schedule: warmup 100 steps to 1e-5, cosine to 0 over total.
scripts/04_sft_av.py — assigns AV to 4 of 8 GPUs (data-parallel), 256 batch size, ~10 epochs over 5k pairs. Saves to checkpoints/av_sft.pt.
tests/test_adapter.py — adapter input/output shapes; av.generate() end-to-end smoke with random residual.
Success Criteria:
Automated: SFT train loss decreases monotonically (within noise); held-out cross-entropy improves over base Qwen.
Manual: generate 10 verbalizations on held-out residuals; outputs should be coherent English mentioning at least one specific square. If outputs are gibberish or empty, debug the adapter init / inject site before declaring done.
Risks: Adapter initialization scale — if too large, AV output collapses to garbled; if too small, AV ignores activation. Verify via attention/logit attribution that the inject token influences output.
Goal: Four reconstructor models, each trained to map their respective inputs to the layer-6 residual under MSE. The four are: (a) mean — trivial baseline, just predicts the train-set mean; (b) prompt-only — small encoder reading the Othello move sequence; (c) verb-only (the paper-faithful primary AR) — reads the AV's text; (d) prompt+verb — reads both. Headline gap = (d) FVE − (b) FVE.
Changes Required:
src/nla/models/ar.py — AR: 4-layer transformer encoder (d=512, h=8, ~25M params) reading the AV verbalization wrapped in the paper's "Summary of the following text: <text>...</text> <summary>" prompt. Final-token pooled → linear head → ℝ512.
src/nla/models/ar_baselines.py — PromptOnlyAR (reads Othello move tokens via a fresh small GPT-2-style encoder), PromptPlusVerbAR (concatenates move sequence + verbalization with a separator), MeanAR (predicts dataset mean, no parameters).
scripts/05_sft_ar.py — trains all 4 ARs data-parallel across remaining 4 GPUs (P5 uses 4). SFT data = (P4 summaries, P3 activations). 256 batch size, MSE on unit-L2-normalized targets.
tests/test_ar.py — shape tests; MeanAR sanity (predicting mean gives FVE = 0 by definition).
Success Criteria:
Automated: all 4 ARs train without NaN; eval FVE: MeanAR ≈ 0, others > 0. Prompt-only AR FVE < 0.7 (otherwise it has saturated the deterministic prompt→activation mapping and the AR is too big — shrink to 2 layers).
Manual: inspect the eval-loss curves of all 4; verb-only and prompt+verb should be in similar ballparks post-SFT (~0.3 FVE per paper).
Risks: If prompt-only AR saturates to ceiling, our headline gap collapses. Mitigation: start small, shrink AR depth (4 → 2 layers) if needed.
Goal: Drive the verb-only AR's FVE up by RL'ing the AV with reward = −log MSE under the AR. This is the headline experiment; project risk concentrates here.
Changes Required:
src/nla/training/grpo.py — single GRPO step: sample G rollouts, compute group-normalized advantages from rewards, compute policy gradient with KL penalty toward frozen post-SFT AV (NOT the original Qwen base — paper anchors to SFT-AV).
src/nla/training/joint_loop.py — per-batch: (1) sample G=8 verbalizations from AV at T=1 with length cap 150 toks; (2) compute reward via verb-only AR (frozen for this step); (3) GRPO step on AV; (4) take the most-recently-sampled verbalizations and do one MSE step on the verb-only AR; AV and AR updates do NOT couple gradients. Repeat.
scripts/06_joint_rl.py — orchestrates the full RL run; checkpoints every 200 steps; logs FVE + KL + reward to W&B; runs the KL sweep first.
src/nla/training/kl_sweep.py — small grid (β ∈ {0.01, 0.05, 0.1, 0.5}), 500-step probe runs, pick the β that keeps KL bounded while reward trends up. Then run the chosen β for the long run.
configs/grpo.yaml — populated with picked β, lr=1e-5, batch=128, G=8, steps=3000.
tests/test_grpo_step.py — one GRPO step on synthetic data; verifies advantage normalization, KL computation, gradient shape.
Success Criteria:
Automated: No NaN/Inf in losses; KL stays bounded (within 5× its post-warmup level); FVE checkpoint at step 3000 strictly above SFT-only FVE checkpoint.
Manual: inspect verbalizations every 500 steps. They should remain in English, mention specific squares, and become slightly more informative over time. Stop condition: if FVE has not moved above SFT baseline by step 1500, abort, document, write up failure path.
Risks: KL collapse (mitigate via sweep), AV degeneration to gibberish (caught by writing-quality check every 500 steps), runaway compute (1 node-day hard cap), AR not improving fast enough to provide useful reward gradient (mitigated by alternating with extra AR-only steps when needed).
Reasoning note (Ultrathink applies): The thing that distinguishes this phase is that two things have to work simultaneously — the AV has to find better verbalizations and the AR has to keep up. A failure here can come from either side or from their interaction. Diagnostics (per-step reward, per-step KL, per-step AR loss, sample verbalizations) need to be wired up before the long run, not after a failed long run.
src/nla/eval/steganography.py — Apply (i) paraphrase via CLI, (ii) bullet-shuffle, (iii) translate to French via CLI. Recompute FVE on each. Small drops are healthy; large drops flag steganography.
src/nla/eval/confab_judge.py — pick 50 positions (stratified across early/mid/late game), generate AV verbalizations, run the LLM-judge prompt: "Here is the move sequence, board state, probe-derived features, and a description; rate the description's factual accuracy 0–3 and list any specifically-wrong claims." Aggregate hit-rate.
Automated: all three result files exist with expected schemas; FVE table has all 4 ARs × all checkpoints; confab_judge has 50 rows.
Manual: read 10 worked examples (5 high-judge, 5 low-judge). Are the high-judge ones genuinely accurate vs. probe? Are the low-judge ones genuine confabulations, not just terse but correct?
standardP9 — Notion writeup + go/no-go (FR13)
Goal: The decision artifact. Everything else is plumbing for this.
Changes Required:
writeup/notion_draft.md — markdown source synced to Notion (one-shot, manually pasted; we are not building a Notion API client for a spike).
Sections: What we built · Setup & hparams table · FVE table (annotated) · Confabulation hit-rate with 10 examples · Steganography sanity check result · Writing-quality decay · What surprised me · Go/no-go on Pleiades.
The go/no-go is exactly one sentence followed by the three bullets of reasoning that drove it.
Success Criteria:
Automated: all required sections present in markdown (lint with a tiny pre-commit check).
Manual: Notion page published; recommendation is unambiguous; a teammate reading it can replicate the FVE table from the artifact links.
Goal: Push one verbalization to claim a square is "mine" when the probe says "empty", run AR, check whether the predicted residual moves in the direction the probe associates with the claimed feature. Decide at P8 whether to do this based on remaining time and confabulation results.
scripts/08_intervention.py — runs 10 paired (original, edited) verbalizations; reports cosine similarity of AR-predicted residuals to the probe direction of the claimed feature.
Success Criteria:
Manual: the experiment runs and produces a number. No threshold — we report what we see.
9. Testing Strategy
This is a spike. The testing pyramid is shallow on purpose:
Smoke tests at every phase boundary — tests/smoke/test_imports.py (P1), tests/test_othello_load.py + tests/test_probe.py (P2), shape + invariant tests for activations/adapter/AR (P3, P5, P6), one GRPO-step-on-synthetic-data test (P7).
Per-phase invariant checks baked into the script — e.g., probe accuracy gate in 01_load_and_verify.py raises if < 95%; FVE-of-MeanAR check in 05_sft_ar.py.
No integration test for the full pipeline — the integration is the scientific result. The FVE table + LLM-judge hit-rate are the headline acceptance gate.
What we are not doing: end-to-end fixture-based pipeline tests, mutation testing, coverage targets. These are over-engineering for a 5-pt learning spike.
10. Risks (plan-level)
The FRD's risk table covers technique-level risks (KL collapse, confabulation, steganography). Plan-level execution risks are different:
Risk
Mitigation
Phase P4's Claude Code CLI rate-limits at scale (5k summaries even at 20-per-batch is ~250 calls)
Pace at 5 calls/min; checkpoint summaries.jsonl after every batch so a rate-limit hit doesn't lose progress; pin model with --model.
The 8×H200 node access is delayed
P5, P6, P7 can be staged from local Mac up to a point; full RL run gated on node. Schedule node access early in the week.
P7 fails to find a working β (KL coefficient)
Sweep is built into the plan (kl_sweep.yaml); if no β in {0.01, 0.05, 0.1, 0.5} works, expand to {0.001, 1.0} before declaring failure.
Prompt-only AR baseline saturates → headline gap collapses to noise
Built-in check in P6 (eval FVE < 0.7 on prompt-only); if breached, shrink AR depth from 4 → 2 layers.
Confabulation grading is too lenient/strict (LLM-judge prompt-sensitivity, ±10% per paper)
Run the judge prompt on 5 obvious-true and 5 obvious-false examples first; iterate prompt until calibrated before running on the 50-position set.
Total time exceeds the 5-pt estimate
P10 is explicitly optional. Confabulation eval (P8) can also degrade from "50 + judge" to "15 + eyeball" if compute pressure forces it.
11. Open Questions to Resolve During Execution
Tracked here rather than blocking the spec:
Exact AR encoder — from-scratch 4-layer or a truncated Qwen mini? Decide after P6's first SFT eval-curve looks; the call is "whichever gives lower prompt-only FVE while still being above mean baseline."
Final β for the long RL run — decided during P7 sweep, not now.
Whether to run P10 — decided at the end of P8 based on time + whether the confabulation result motivates it.
Exact W&B project name — bikeshed inside P7.
Approval: see Linear ticket for sign-off. After Plan approval, execution begins per the DAG above; ticket moves to In Progress and each phase commit references the phase number for traceability.