TECH-3502 — Plan: Toy NLA on Othello-GPT

Ticket: TECH-3502 · FRD: FRD · Type: Learning spike (5 pts) · Date: 2026-05-12

TL;DR

1. Overview

This plan operationalizes the FRD. The end state is a runnable, reproducible NLA training loop on Othello-GPT residuals at layer 6, an FVE table comparing four diagnostic reconstructor configurations, a confabulation audit on ~50 positions scored by an LLM judge, and a Notion writeup containing a one-sentence go / no-go on applying NLAs to Pleiades.

2. FRD Requirement → Phase Mapping

FRRequirement (summary)Phase
FR1Load Othello-GPT via TransformerLens, verify next-legal-move sanityP2
FR2Reproduce probe accuracy > 95% at layer 6P2
FR3Activation dataset @ layer 6, unit-L2 normalized, train/eval splitP3
FR4~5k SFT (residual, summary) pairs via Claude Code CLIP4
FR5AV (Qwen2.5-1.5B) SFT + GRPO RL with KL anchorP5, P7
FR6AR (small text encoder + linear head) SFT + joint RLP6, P7
FR7Activation injection via learned affine adapterP5
FR8FVE on held-out, primary headline metricP8 (computed during P7 too)
FR9Three diagnostic baselines: mean / prompt-only / prompt+verbP6 (build), P8 (report)
FR10Confabulation audit: ~50 positions, LLM judgeP8
FR11Steganography sanity (paraphrase + shuffle)P8
FR12Writing-quality decay trackingP8
FR13Notion writeup + go/no-goP9
FR14Optional: one intervention experimentP10 (optional)

Every FR maps to at least one phase. No gaps.

3. Current State Analysis

The worktree at nla-worktrees/TECH-3502 contains only:

No source code, no tests, no scaffolding beyond the Python project shell. Everything below is new.

4. Desired End State

  1. src/nla/ package with submodules for data, models, training, eval.
  2. scripts/0[1-7]_*.py orchestration scripts, one per major step, runnable independently.
  3. tests/ with smoke tests gating each phase boundary.
  4. results/fve_table.json with FVE numbers for {mean, prompt-only AR, verb-only AR, prompt+verb AR} × {SFT-only, RL-steps {1k, 2k, 3k, final}}.
  5. results/confab_judge.jsonl with per-position judge scores + ~10 worked examples.
  6. results/steg_sanity.json with FVE delta under paraphrase + bullet-shuffle of AV outputs.
  7. A Notion page (linked from the writeup file) containing setup, hparams, FVE plots, surprises, and the go/no-go.
  8. All training logs in W&B for posterity.

5. What We're Not Doing

Inherits from the FRD's Non-Goals section. Summary:

6. Implementation Approach

The work is mostly serial because each phase depends on artifacts produced by the previous one: model load → activation cache → SFT data → SFT-warmed AV/AR → joint RL → eval → writeup. The single useful parallelization is in Wave 5: once SFT data exists, AV SFT (P5) and the four ARs' SFT (P6) train on disjoint GPUs of the 8×H200 node simultaneously.

The single critical phase is P7 (joint GRPO+MSE RL). Everything before P7 is "plumbing the experiment"; everything after P7 is "report what happened." If P7 fails (KL collapse, AV degeneration, runaway divergence), the spike still has a story to tell — that becomes the surprise / lesson and feeds the go/no-go.

7. Execution DAG

Phases at the same vertical level can run in parallel. Dashed arrows show optional dependencies (e.g., P8 reads checkpoints during P7 for intermediate FVE plots).

Wave 1 P1 [trivial] Dependencies + bootstrap Wave 2 P2 [standard] Othello-GPT + probe shakedown Wave 3 P3 [standard] Activation dataset @ L6 Wave 4 P4 [complex] SFT data via Claude Code CLI (~5k pairs) Wave 5 (||) P5 [complex] AV + activation injection (SFT) P6 [complex] 4 ARs (SFT, data-parallel) Wave 6 P7 [CRITICAL] Joint GRPO + MSE RL loop (1 node-day cap) Wave 7 P8 [complex] Eval: FVE + steg sanity + confab audit Wave 8 P9 [standard] Notion writeup + go/no-go P10 [COULD] Intervention experiment (opt.) Complexity legend trivial — config / deps; standard — well-trodden code; complex — multi-file, new abstractions; CRITICAL — novel algorithm + project-risk-bearing Solid arrow = hard dependency · Dashed arrow = soft/optional dependency · Dashed box = optional phase

8. Phases

trivial   P1 — Dependencies + repo bootstrap

Goal: Make the repo runnable. Add deps, package skeleton, smoke test that everything imports.

Changes Required:

Success Criteria:

standard   P2 — Othello-GPT load + probe shakedown (FR1, FR2)

Goal: Confirm the entire downstream story is plausible by reproducing the published probe-accuracy numbers before we spend compute on activations.

Changes Required:

Success Criteria:

Gotchas to watch (from the research brief): two index spaces (string 0–63 vs. int 0–60), turn-parity in the probe (use tl_probing_v1.py's mode-split idiom), off-by-one on context length (n_ctx=59).

standard   P3 — Activation dataset @ layer 6 (FR3)

Goal: A clean, reusable cache of layer-6 residuals on ~30k synthetic Othello games. Single source of truth for all downstream training.

Changes Required:

Success Criteria:

Notes: ~30k games × ~50 positions × 512 d_model in fp16 = ~1.5 GB on disk per layer. Negligible.

complex   P4 — SFT warm-start data via Claude Code CLI (FR4)

Goal: ~5k (residual, summary) pairs where each summary is a 2–3-snippet description of "what Othello-GPT is computing at layer 6 for this position." This is the single most important risk-reducer for AV training — the paper notes runs collapse without it.

Changes Required:

Success Criteria:

Risks: CLI rate limits, model-version drift between calls (mitigated by pinning --model), batched-JSON parse failures (mitigated by retry + small batch size).

complex   P5 — AV with activation injection (SFT warm-start) (FR5 SFT half, FR7)

Goal: A Qwen2.5-1.5B that, after SFT, produces a paper-style 2–3-snippet description when handed an Othello-GPT layer-6 residual.

Changes Required:

Success Criteria:

Risks: Adapter initialization scale — if too large, AV output collapses to garbled; if too small, AV ignores activation. Verify via attention/logit attribution that the inject token influences output.

complex   P6 — AR + 3 diagnostic baselines (SFT warm-start) (FR6, FR9 build)

Goal: Four reconstructor models, each trained to map their respective inputs to the layer-6 residual under MSE. The four are: (a) mean — trivial baseline, just predicts the train-set mean; (b) prompt-only — small encoder reading the Othello move sequence; (c) verb-only (the paper-faithful primary AR) — reads the AV's text; (d) prompt+verb — reads both. Headline gap = (d) FVE − (b) FVE.

Changes Required:

Success Criteria:

Risks: If prompt-only AR saturates to ceiling, our headline gap collapses. Mitigation: start small, shrink AR depth (4 → 2 layers) if needed.

CRITICAL   P7 — Joint GRPO + MSE RL loop (rest of FR5 + FR6, primary FR8 driver)

Goal: Drive the verb-only AR's FVE up by RL'ing the AV with reward = −log MSE under the AR. This is the headline experiment; project risk concentrates here.

Changes Required:

Success Criteria:

Risks: KL collapse (mitigate via sweep), AV degeneration to gibberish (caught by writing-quality check every 500 steps), runaway compute (1 node-day hard cap), AR not improving fast enough to provide useful reward gradient (mitigated by alternating with extra AR-only steps when needed).

Reasoning note (Ultrathink applies): The thing that distinguishes this phase is that two things have to work simultaneously — the AV has to find better verbalizations and the AR has to keep up. A failure here can come from either side or from their interaction. Diagnostics (per-step reward, per-step KL, per-step AR loss, sample verbalizations) need to be wired up before the long run, not after a failed long run.

complex   P8 — Final eval pipeline + confabulation audit (FR8, FR9, FR10, FR11, FR12)

Goal: Produce the headline numbers that go into the writeup.

Changes Required:

Success Criteria:

standard   P9 — Notion writeup + go/no-go (FR13)

Goal: The decision artifact. Everything else is plumbing for this.

Changes Required:

Success Criteria:

COULD   P10 — Intervention experiment (FR14, optional)

Goal: Push one verbalization to claim a square is "mine" when the probe says "empty", run AR, check whether the predicted residual moves in the direction the probe associates with the claimed feature. Decide at P8 whether to do this based on remaining time and confabulation results.

Changes Required:

Success Criteria:

9. Testing Strategy

This is a spike. The testing pyramid is shallow on purpose:

What we are not doing: end-to-end fixture-based pipeline tests, mutation testing, coverage targets. These are over-engineering for a 5-pt learning spike.

10. Risks (plan-level)

The FRD's risk table covers technique-level risks (KL collapse, confabulation, steganography). Plan-level execution risks are different:

RiskMitigation
Phase P4's Claude Code CLI rate-limits at scale (5k summaries even at 20-per-batch is ~250 calls)Pace at 5 calls/min; checkpoint summaries.jsonl after every batch so a rate-limit hit doesn't lose progress; pin model with --model.
The 8×H200 node access is delayedP5, P6, P7 can be staged from local Mac up to a point; full RL run gated on node. Schedule node access early in the week.
P7 fails to find a working β (KL coefficient)Sweep is built into the plan (kl_sweep.yaml); if no β in {0.01, 0.05, 0.1, 0.5} works, expand to {0.001, 1.0} before declaring failure.
Prompt-only AR baseline saturates → headline gap collapses to noiseBuilt-in check in P6 (eval FVE < 0.7 on prompt-only); if breached, shrink AR depth from 4 → 2 layers.
Confabulation grading is too lenient/strict (LLM-judge prompt-sensitivity, ±10% per paper)Run the judge prompt on 5 obvious-true and 5 obvious-false examples first; iterate prompt until calibrated before running on the 50-position set.
Total time exceeds the 5-pt estimateP10 is explicitly optional. Confabulation eval (P8) can also degrade from "50 + judge" to "15 + eyeball" if compute pressure forces it.

11. Open Questions to Resolve During Execution

Tracked here rather than blocking the spec:


Approval: see Linear ticket for sign-off. After Plan approval, execution begins per the DAG above; ticket moves to In Progress and each phase commit references the phase number for traceability.