TECH-3501 Plan: stream_mixing + ledger observability metrics

≈2200 words · ≈10 min read · last updated 2026-05-12 · 8 phases in 4 waves · ≈ 5pt

TL;DR

Overview

This plan delivers the 17 functional requirements in FRD v5 — 11 new wandb scalars (7 stream_mixing + 4 ledger, with auto _max siblings on the ledger metrics) across two subsystems, plus the trainer wiring that joins them. The breakdown maximises parallelism while keeping each phase atomic and committable on its own.

Critical sequencing constraint: P4 (ledger false-eviction) needs the SUM-mode extension from P3 (reduce_loss_metrics) to surface its counters correctly cluster-wide. So P3 must complete before P4 starts. Everything else is wider parallelism.

FRD Requirements Summary

From the FRD (currently local-only, see Linear comment for path):

Current State Analysis

Establishing the load-bearing baseline points from research:

Desired End State

After all phases land:

What We're NOT Doing

Execution DAG

Execution DAG Wave 1 fans out into three independent phases (P1 stream_mixing dataset, P3 reduce_loss_metrics SUM-mode, P7 diagnostic smoke configs); wave 2 runs P2 (aggregator + trainer wiring, depends on P1) and P4 (ledger false-eviction instrumentation, depends on P3); wave 3 runs P5 (stream_mixing tests, depends on P2) and P6 (ledger tests, depends on P4); wave 4 runs P8 (smoke verification, depends on P5, P6, and P7). Wave 1 Wave 2 Wave 3 Wave 4 P1 · standard stream_mixing dataset instrumentation P3 · standard reduce_loss_metrics SUM mode P7 · trivial diagnostic smoke configs P2 · standard aggregator + trainer wiring P4 · complex ledger false-eviction instrumentation P5 · standard stream_mixing unit tests P6 · standard ledger false-eviction tests P8 · trivial healthy smoke + 3 diagnostic smokes (wandb verification)
trivial standard complex (think hard) critical (ultrathink, n/a here)

Phases

Phase 1: stream_mixing dataset instrumentation standard

Goal: Land all dataset-side changes for stream_mixing observability — new metric population at the pick site, modality helper extraction, removal of the old debug log knob, resume-reshard INFO log. Pure dataset edits; no trainer wiring yet.

Maps to FRDs: FR1, FR2, FR3, FR6, FR7, FR9.

Changes Required:

Success Criteria:

Phase 2: stream_mixing aggregator + trainer wiring standard

Goal: Add the cross-rank aggregator that gives each metric its correct distributed reduction, and wire it into the trainer's per-step metric flow.

Maps to FRDs: FR4, FR5.

Changes Required:

Success Criteria:

Depends on: P1 (needs drain_step_metrics API).

Phase 3: reduce_loss_metrics SUM-mode extension standard

Goal: Add an opt-in SUM reduction mode to the shared reduce_loss_metrics utility so counter metrics emit cluster-wide totals instead of per-rank means. Pure shared-utility change; independent of stream_mixing and ledger work.

Maps to FRDs: FR13.

Changes Required:

Success Criteria:

Depends on: nothing (Wave 1).

Phase 4: Ledger false-eviction instrumentation complex · think hard

Goal: Detect false ledger evictions via the inverted check at _store_state fresh-write. Single instrumentation site, owner-local by construction, surfaced through the existing last_extra_metrics pipe with SUM-mode reduction (from P3).

Why complex: cross-subsystem invariant; ownership semantics enforced by recurrent_encoder.py:1108-1114; must respect the all-to-all-v paths that orchestrate ledger reads/writes across ranks; eviction count must match the existing local counter returned by _evict_lru_overflow at L1195; full-wipe path must clear _recently_evicted so we don't generate spurious false-eviction signal after intentional wipes.

Maps to FRDs: FR10, FR11, FR12.

Changes Required:

Success Criteria:

Depends on: P3 (needs sum_keys kwarg in reduce_loss_metrics).

Phase 5: stream_mixing unit tests standard

Goal: Lock in the stream_mixing metric contract with focused unit tests. Lean coverage per the user's testing preference — fewer high-value tests, not exhaustive accessor coverage.

Maps to FRDs: FR14.

Changes Required:

Success Criteria:

Depends on: P2 (needs aggregate_step_metrics to test).

Phase 6: Ledger false-eviction tests standard

Goal: Lock in the ledger false-eviction join contract and the SUM-mode round-trip for the cluster reduction.

Maps to FRDs: FR15.

Changes Required:

Success Criteria:

Depends on: P4 (needs ledger instrumentation), P3 (needs SUM-mode kwarg).

Phase 7: Diagnostic smoke configs trivial

Goal: Three deliberately-misconfigured TOMLs that drive specific metrics out of healthy range. Pure config files; no Python edits.

Maps to FRDs: FR17.

Changes Required:

Success Criteria:

Depends on: nothing for write; Wave 4 (P8) for actual run verification.

Phase 8: Smoke verification + wandb panel capture trivial

Goal: Run the healthy smoke + 3 diagnostic smokes; verify metrics in wandb; attach panel screenshots to the PR. No code change — this is the empirical verification gate.

Maps to FRDs: FR16 (healthy), FR17 (diagnostic — run side).

Changes Required (operational, no code):

Success Criteria:

Depends on: P5 + P6 (code lands), P7 (smoke configs exist).

FR → Phase mapping

FRDescription (short)Phase
FR1emit 7 stream_mixing scalars at pick siteP1
FR2track _last_picked_step per readerP1
FR3drain_step_metrics() APIP1
FR4aggregate_step_metrics() aggregatorP2
FR5trainer drain + aggregate + merge into extra_metricsP2
FR6_modality_for_reader helper extractionP1
FR7remove debug knob + log line + 3 testsP1
FR8per-step overhead < 50 µsP1 + P2 (verified manually)
FR9resume-reshard INFO logP1
FR10_recently_evicted ring + populate at 2 eviction sites + clear on full-wipeP4
FR11instrument _store_state fresh-write checkP4
FR12surface 4 ledger counters via last_extra_metricsP4
FR13reduce_loss_metrics SUM-mode kwargP3
FR14stream_mixing unit testsP5
FR15ledger false-eviction + SUM-mode round-trip testsP6
FR16healthy smoke run + wandb verificationP8
FR173 diagnostic smoke configs + verification runsP7 (configs) + P8 (runs)

Every FR maps to ≥ 1 phase; FR8 (cost budget) is enforced manually during P1 and P2 via a microbench in a REPL session — not a separate phase since it's a non-functional acceptance check, not a code deliverable.

Testing Strategy

Unit-test layout: two new test files (P5, P6) plus a small addition to tests/unit_tests/distributed/test_utils.py for the SUM-mode round-trip (part of P6).

Coverage discipline: per the user's "lean but comprehensive" preference, each test covers one specific contract — no accessor sprawl, no "test that the field exists" trivia. Mutation-sanity checked at the end of each test phase by deliberately breaking one line of the corresponding source phase and confirming the right test fails.

Distributed paths:

Integration verification: P8 is the end-to-end gate. A single-rank local-test (via scripts/test-local) during P2 and P4 catches the "metrics appear in the dict" failure mode; the multi-rank smoke catches "metrics appear in wandb with cluster-correct values."

Operational notes

Cost-budget verification protocol (FR8)

During P2, run a microbench in a REPL: instantiate the dataset, drain 1000 batches in a tight loop, measure wall time. With and without the new metric population. Delta should be < 50 µs/step at K=4. Record in the PR description. Acceptance: the delta is < 50 µs; if not, profile and revisit before P4.

Concurrent-edit safety

Wave 1's three phases all edit different files (dataset.py / factory.py / job_config.py for P1; distributed/utils.py for P3; new TOML files for P7). No conflict surface — safe to land independently. Wave 2's P2 and P4 also touch disjoint files (dataset.py + train.py:842-862 for P2; recurrent_encoder.py + loss.py + train.py:821 for P4). The two train.py edits live at different lines (L842 vs L821) so even within train.py there's no overlap.

Commit boundaries

Each phase = ≥ 1 atomic commit on the worktree branch. Suggested commits: