# V4-Flash on RTX 3060 Ti — Full Technical Brief
*Snapshot: 2026-04-25, by Claude Opus 4.7. Hand this to GPT-5.5 for optimization advice.*

---

## 0. TL;DR

DeepSeek-V4-Flash (154B params, 43-layer MoE, FP4 routed experts + FP8 backbone) is **running on a single RTX 3060 Ti** (sm_86, 8 GB VRAM) inside WSL2 with 13 GB RAM + 80 GB swap. Pure-PyTorch fallback for FP4/FP8 GEMM kernels (sm_86 lacks native FP8 mma.sync). After 19× of optimization on top of the naive baseline:

| Metric | Value |
|---|---|
| Cold prefill (1 prompt token) | 47.3 s |
| Warm decode best | **7.5 s/token** |
| Warm decode median (p50) | 9.6 s/token |
| Warm decode p95 | 27.7 s/token |
| Warm decode avg | 13.8 s/token |
| Steady-state (toks 9–16) | ~9.5 s/token |
| Output | semantically valid (PHP/Laravel scaffolding from `"hi"` prompt) |

Target: 3–4 s/token (per user). Stretch: 1 s/token.

---

## 1. Latest Run — Raw Numbers

**File**: `/tmp/flash43_batchunpack.log` (16 tokens, 43-layer full model, max_seq=64)

```
materialize done       273.4 s  (one-time setup, weight loader walks 46 safetensors shards)
cold forward (prefill) 47.3 s  → token 201 '\n'
warm forward (#2)      32.0 s  → 128818 '<｜end▁of▁file▁name｜>'
tok#3                  27.7 s  → 201 '\n'
tok#4                  18.2 s  → 128819 '<｜begin▁of▁file｜>'
tok#5                  19.8 s  → 201 '\n'
tok#6                  8.4 s   → 13989  '<?'
tok#7                  17.9 s  → 8276   'php'
tok#8                  8.3 s   → 271    '\n\n'
tok#9                  8.6 s   → 10972  'namespace'
tok#10                 11.3 s  → 2824   ' App'
tok#11                 11.1 s  → 27067  '\Http'
tok#12                 8.5 s   → 63083  '\Controllers'
tok#13                 7.5 s   → 62     '\'
tok#14                 10.4 s  → 32796  'Admin'
tok#15                 8.3 s   → 1048   ';\n\n'
tok#16                 9.6 s   → 3103   'use'
```

**Output (decoded)**:
```php
<?php

namespace App\Http\Controllers\Admin;

use
```
Real, syntactically coherent code. Confirms numerical pipeline correct (FP4 dequant + scaling + sparse_attn + HC Sinkhorn all sane).

### System state during/after

| Resource | Total | Used | Free | Notes |
|---|---|---|---|---|
| GPU VRAM | 8 192 MiB | 706 MiB (idle now) | 7 319 MiB | Peak during decode ≈ 6.6 GB (resident params 6.48 GB + scratch ~150 MB) |
| RAM | 13 GiB (WSL2 capped) | 1.0 GiB (idle) | 12 GiB | At runtime ≈ 8–10 GiB used; OS page cache for mmap'd 165 GB virtual |
| Swap | 80 GiB | 0.74 GiB | 79 GiB | Barely touched — mmap pages in/out lazily, not via swap |
| Disk read | — | — | — | Estimate 3–5 GB/token cold, 0.5–1.5 GB/token warm (tracked indirectly) |
| Page faults | — | — | — | Not measured. Likely thousands of minor + tens-hundreds major cold |

**Disk model size**: 148.65 GB across 46 safetensors shards (`model-{N}-of-46.safetensors`).

**Hardware**: RTX 3060 Ti (sm_86 = Ampere, GA104). 8 GB GDDR6 (~448 GB/s). PCIe 4.0 x16 (~32 GB/s theoretical, ~12–18 GB/s real on this WSL2 setup). NVMe SSD (Windows-side, exposed via 9P / DrvFs to WSL — this layer is slow and likely a major bottleneck for cold expert reads).

---

## 2. Architecture (V4-Flash)

From `deepseek_v4_flash/config.json`:

| Field | Value |
|---|---|
| `num_hidden_layers` | 43 |
| `hidden_size` | 4 096 |
| `num_attention_heads` | 64 (Q heads), `num_key_value_heads`=1 (extreme MQA) |
| `head_dim` | 512 |
| `qk_rope_head_dim` | 64 |
| `q_lora_rank` | 1 024 |
| `o_lora_rank` | 1 024, `o_groups` = 8 (grouped low-rank O proj) |
| `n_routed_experts` | 256 |
| `num_experts_per_tok` (top-k) | **6** |
| `n_shared_experts` | 1 |
| `moe_intermediate_size` | 2 048 (per expert FFN inner) |
| `vocab_size` | 129 280 |
| `max_position_embeddings` | 1 048 576 (with YaRN scaling, original 65 536) |
| `sliding_window` | 128 |
| `compress_ratios` | per-layer `[0,0,4,128,4,128,…,4,0]` (HCA — heavily compressed attention) |
| `index_n_heads` | 64, `index_topk` = 512 (CSA — compressed sparse attention) |
| `hc_mult` | 4 (Hyper-Connections multiplicity) |
| `hc_sinkhorn_iters` | 20 |
| `quantization_config.fmt` | `e4m3` FP8, `weight_block_size`=[128,128] |
| `expert_dtype` (in our load) | `float4_e2m1fn_x2` (native FP4) |
| `scale_fmt` | `ue8m0` (in checkpoint), but we run with FP32 scales (CUDA 12.0 nvcc can't construct fp8_e8m0fnu) |
| `scoring_func` | `sqrtsoftplus` + `noaux_tc` topk |

### Component sizes (per layer, then ×43)

| Component | Per-layer | Format | Per-layer bytes | × 43 |
|---|---|---|---|---|
| `attn.wq_a` | [1 024, 4 096] | FP8 e4m3 | 4 MB | 172 MB |
| `attn.wq_b` | [32 768, 1 024] | FP8 | 32 MB | 1 376 MB |
| `attn.wkv` | [512, 4 096] | FP8 | 2 MB | 86 MB |
| `attn.wo_a` | [8 192, 4 096] | FP8 → BF16 (we baked at load via convert.py formula) | 64 MB BF16 | 2 752 MB |
| `attn.wo_b` | [4 096, 8 192] | FP8 | 32 MB | 1 376 MB |
| `attn_norm`,`ffn_norm`,`q_norm`,`kv_norm`,`compressor.norm` | — | BF16 → cast FP32 in model | ~50 KB | ~2 MB |
| `attn.attn_sink` (per-head sink) | [64] | FP32 | 256 B | 11 KB |
| `compressor.wkv` (FP32 weight, only on layers with compress_ratio>0; coff=2 if ratio==4) | [1 024 or 512, 4 096] | FP32 | 16–32 MB | ~1.0–1.3 GB total |
| `compressor.wgate` | same as wkv | FP32 | same | ~1.0–1.3 GB |
| `indexer.wq_b` (only on ratio==4 layers, 21 of 43) | [64×128, 1 024] | BF16 | 16 MB | 336 MB |
| `indexer.weights_proj` | [4 096, 64] | BF16 | 0.5 MB | 11 MB |
| `indexer.compressor.*` | (own Compressor with rotate=True) | FP32 | ~32 MB | ~672 MB |
| HC params per Block: `hc_attn_fn`/`hc_ffn_fn` | [24, 16 384] FP32 each | FP32 | ~1.5 MB | 130 MB |
| `gate.weight` | [256, 4 096] | FP32 | 4 MB | 172 MB |
| Routed expert i.{w1, w3} | [2 048, 4 096] | FP4_e2m1fn_x2 packed = uint8 [2048, 2048] | 4 MB FP4 / 16 MB BF16 | per-layer 256 ×8 MB FP4 = 2 GB |
| Routed expert i.w2 | [4 096, 2 048] | FP4 | 4 MB FP4 | per-layer 1 GB |
| **Total routed (per layer)** | | | **3 GB FP4 / 12 GB BF16** | **132 GB FP4 / 528 GB BF16** |
| Shared expert (1 per layer) | three [2 048,4 096]/[4 096,2 048] FP8 | FP8 | 24 MB | 1 GB |
| Embed | [129 280, 4 096] | BF16 | 1 GB | 1 GB |
| Head | [129 280, 4 096] | BF16 (we cast) | 1 GB | 1 GB |
| Final norm + global HC head | tiny | FP32 | ~50 KB | 50 KB |

**Total backbone (everything except routed FP4 experts)**: ≈ 11.5 GB raw. We trimmed it to 6.48 GB GPU-resident by leaving `wo_a` (2.7 GB BF16) on CPU-mmap and head (1 GB BF16) on CPU-mmap with streaming patches.

### Routing properties (observed empirically)

- `top_k = 6` of 256 experts per token per layer.
- 6 × 43 = **258 unique (layer, expert) calls per token**.
- Hot-expert cache experiment: cap=128, **0.0 % hit rate** across 3 forwards (774 misses, 0 hits). This means consecutive tokens routed to almost entirely different experts. Either: model under-trained for our weights (post-hoc dequant noise), prompt was synthetic, or genuine high routing diversity. **TBD with longer real prompts.**
- Across ≥10 tokens we see SOME steady-state cache benefit from OS page cache (warm tokens drop from ~30s to ~8s), so disk-level reuse exists even if VRAM-level doesn't.

---

## 3. Where Time Goes Now (estimates from coarse `cuda.synchronize()` timing)

From an earlier run (before some optimizations), per-call breakdown (warm decode, summed across 3 forwards, divided):

| Phase | Time / forward (warm) | Notes |
|---|---|---|
| `Block.forward` (sum) | ~50 s | wraps attn + moe + HC |
| `Attention.forward` | ~8 s | 43 calls/forward, **186 ms/call** |
| `MoE.forward` | ~34 s | 43 calls/forward, **792 ms/call** ← dominant |
| HC pre/post + norms + embed + head | ~8 s | tiny ops, mostly Python overhead |

After current optimizations (batched FP4 unpack + GPU-resident backbone + Compressor on GPU), warm steady-state ≈ 8 s/token. Same ratio implies:

| Phase (post-opt estimate) | ms/layer | 43 layers |
|---|---|---|
| Attention | ~30 ms | 1.3 s |
| MoE (routed: FP4 unpack + bmm) | ~120 ms | 5.2 s |
| MoE (shared expert) | ~10 ms | 0.4 s |
| HC + norms + Python overhead | ~25 ms | 1.1 s |
| **Total** | | **~8 s** |

### Inside MoE (per layer, ~120 ms warm)

1. CPU→GPU upload of 18 weight tensors (6 experts × {w1,w3,w2}): ~12 MB FP4 + scales. **~7 ms PCIe** (warm RAM); much higher when SSD-cold.
2. Batched FP4 unpack into BF16: 3 kernel calls, ~144 MB GPU writes. **~3 ms** GPU.
3. Scale apply (per-32 multiplicative broadcast): **~2 ms**.
4. 3 batched bmms: `[6,1,4096] @ [6,4096,2048]` etc. Tiny. **~1 ms** compute, **~5 ms** with launch overhead.
5. SiLU * up (FP32 elementwise): ~1 ms.
6. **Python orchestration overhead** (build expert list, stack, view, transpose): **~30–60 ms** ← biggest single line item.
7. Sync / wait / scatter-add of weighted sum: ~5 ms.
8. Shared expert matmul (3 FP8 GEMMs, GPU resident weights): ~10 ms.

**~50 % of MoE time is Python-side orchestration & PyTorch dispatch overhead, not GPU compute.**

### Inside Attention (per layer, ~30 ms)

1. RMSNorm in/out: 2× ~0.5 ms.
2. wq_a, wq_b, wkv FP8 GEMM (GPU resident): ~3 × 2 ms = 6 ms.
3. apply_rotary_emb on q and kv (rope dim only): ~1 ms.
4. act_quant inplace on KV (FP8 simulate): ~0.5 ms.
5. Compressor.forward: wkv + wgate FP32 GEMM (now GPU resident), score softmax, kv_state update: ~5 ms.
6. Indexer (only on 21 of 43 layers): own Compressor + wq_b BF16 matmul + weights_proj + topk: ~10 ms.
7. sparse_attn (gather kv at topk_idxs + softmax + bmm): ~5 ms.
8. wo_a einsum (BF16, weight CPU-mmap streamed): **~3 ms** (this still streams).
9. wo_b FP8 GEMM (GPU resident): ~2 ms.

### CUDA launch count per token (estimate)

- Attention: ~25 launches/layer
- MoE: ~30 launches/layer (after batching, was ~80 before)
- HC + norms + misc: ~20 launches/layer
- Per token: 43 × ~75 = **~3 200 CUDA launches**.

Each launch ~30–80 µs Python+driver overhead → **0.1–0.25 s/token of pure dispatch overhead**, hard floor for eager-mode PyTorch.

### GPU↔CPU sync points per token (real cost at ms scale)

1. `gate.topk(...)[1]` returns indices on GPU but `.tolist()` for routing — **forces sync** in `MoE.forward`. 43× per token.
2. `torch.bincount(...).tolist()` in original MoE — patched away in batched version, but `indices[0].tolist()` still syncs.
3. `argmax(-1)[0].item()` at the end of forward to extract next token — **mandatory sync** (1 per forward).
4. Inside Compressor: `should_compress` is computed on Python int, no GPU sync, OK.

---

## 4. Current Optimizations (already applied)

### Code components

| File | Lines | Role |
|---|---|---|
| `flash_mvp.py` | 722 | Main entry: meta-init, HF safetensors loader, monkey-patches, forward driver |
| `kernel_pytorch.py` | 215 | Pure-PyTorch replacements for tilelang kernels (`fp8_gemm`, `fp4_gemm`, `act_quant`, `fp4_act_quant`, `sparse_attn`, `hc_split_sinkhorn`). Installs as `sys.modules["kernel"]`. |
| `fht_fallback.py` | 45 | Pure-PyTorch Walsh-Hadamard butterfly. Installs as `fast_hadamard_transform` shim. |

No custom CUDA / C++ in the V4-Flash path. The folder also has `planck_core_v3.so` (Q8 monolith IO), `planck_cuda.so` (Block-128 Q8 dequant), `xnor_matmul.so` (1-bit matmul) — **NONE of these are wired into V4-Flash** (they were for V2-Lite mycelium).

### What is GPU-resident (~6.48 GB)

- Embed (BF16, 1 GB)
- All 43 layers' attention FP8 weights: wq_a, wq_b, wkv, wo_b (~5.7 GB)
- All 43 layers' shared expert FP8 weights (~1 GB)
- All 43 layers' Compressor wkv/wgate **FP32** (~1.3 GB) ← biggest recent move
- All 43 layers' Indexer wq_b + weights_proj (~360 MB)
- Gate weights × 43 (FP32, ~170 MB)
- All RMSNorm weights (FP32 cast, ~2 MB)
- HC params (FP32, ~130 MB)
- attn_sink, biases, ape, tid2eid (small)
- KV caches, freqs_cis, kv_state, score_state buffers (allocated empty/zero on cuda)

### What stays CPU-mmap (~165 GB virtual)

- Routed expert weights (256 × 43 × 12 MB FP4 = 132 GB) + scales
- `attn.wo_a` BF16 baked at load (2.7 GB) — used via direct `einsum`, streaming patch uploads on demand
- `head.weight` BF16 (1 GB) — streaming patch in `ParallelHead.get_logits`
- Indexer's own Compressor weights (~672 MB FP32)? — TBD whether we caught these
- All scales for FP4 experts (~1 GB total)

### Patches (monkey-patch, no edits to upstream `model.py`)

1. `model.linear()` → wrapper that uploads CPU weights to GPU on demand before dispatch.
2. `Attention.forward` → wraps to upload `wo_a.weight` to GPU before einsum, restore CPU ref after.
3. `ParallelHead.get_logits` → uploads head weight, runs BF16 matmul (cast x to BF16), returns FP32 logits.
4. `MoE.forward` → batched bmm path for decode (B=1):
   - Stack 6 experts' weights into [k, out, in//2] FP4 tensor
   - Single CPU→GPU `.to()` per weight type
   - Single batched FP4 unpack (one kernel call, dequants 6 weights at once)
   - Batched scale apply via reshape+broadcast (lazy, no full expand)
   - 3 `torch.bmm` calls (w1, w3, w2)
5. Buffer migration: meta-init → real cuda buffers (freqs_cis recomputed, kv_caches/states zeroed).
6. dtype matching: cast Compressor wkv/wgate from BF16 ckpt to FP32 expected by model.
7. FP4 expert weight: HF stores as `int8`, we `.view(torch.float4_e2m1fn_x2)` (zero-copy reinterpret).
8. UE8M0 scales (`fp8_e8m0fnu` in checkpoint) → cast to FP32 at load (CUDA 12.0 nvcc bug).
9. `kernel_pytorch.fp8_gemm` and `fp4_gemm`: direct FP8/FP4 → BF16 cast (no FP32 intermediate), lazy view-broadcast scales (no `repeat_interleave` materialization).

### Software stack

- Python 3.12 system interpreter (no venv)
- torch 2.11.0+cu130 (Python prebuilt wheels, **CUDA 13.0 runtime** baked in)
- transformers 5.6.1
- safetensors 0.7.0
- tilelang 0.1.9 (installed via `--break-system-packages`, **bypassed** by our `kernel_pytorch` shim — its kernels won't compile against system nvcc 12.0 due to UE8M0 fp8 type)
- fast_hadamard_transform: **fails to build** (CUDA 12.0 vs torch cu130 mismatch); replaced by pure-PyTorch shim.
- WSL2: Ubuntu 24.04 on Windows 11. RAM cap 14.3 GB (kernel reports 13 GiB usable). NVMe accessed through 9P/DrvFs (slow random read).

---

## 5. Optimization Levers — Estimated Impact

Estimates assume current 8 s/token warm baseline.

| Lever | Est. speedup | Effort | Risk | Notes |
|---|---|---|---|---|
| **Async expert prefetch via cuda streams** | 1.2–1.5× | Low (~50 LOC) | Low | Issue all CPU→GPU expert uploads on `stream_prefetch` while shared expert (or attention next layer's prefix) runs on `stream_compute`. We have a single PCIe lane (~12 GB/s in this setup), so peak throughput limited; but compute-overlap during ~10ms of shared expert is real. |
| **Fused FP4-unpack + GEMM** (custom Triton/CUDA kernel) | 2–3× on MoE | High | Medium | Single kernel reads FP4 packed + scale, dequants in shared mem, runs WMMA BF16 matmul, writes output. Eliminates the `[k, out, in]` BF16 materialization (~144 MB/layer GPU bandwidth). Requires Triton or hand-written CUDA. **Best raw impact lever.** |
| **Persistent BF16 hot-cache in pinned RAM** | 1.1–1.3× | Medium | Low | After first dequant of expert (layer L, expert E), keep the BF16 [out, in] tensor in pinned host RAM. Re-use bypasses FP4 unpack but still pays PCIe upload. Only useful if some experts repeat across tokens — currently 0 % observed, but with longer real prompts (not synthetic "hi") expect 20–40 %. |
| **GPU LRU FP4 cache** (keep recent FP4 packed weights on GPU) | 1.05–1.2× | Medium | Low | 2–3 GB of FP4 (e.g. 200 entries × 12 MB) on GPU. Saves CPU→GPU transfer for hits. Hit-rate-dependent. |
| **Routing predictor / speculative prefetch** | 1.2–1.6× | High | Medium | Train a tiny classifier on prior gate scores → predict next-layer top-K experts → start their PCIe upload early. Could make compute and PCIe fully concurrent. |
| **CUDA Graph capture for decode** | 1.3–1.5× | High | High | Decode forward has identical shape per token, but routed expert IDs vary → graph topology changes per token. Solvable by capturing per (layer, set-of-expert-ids) pattern, but combinatorial. Could capture only the static parts (attention + HC + Compressor + gate) and call routed-experts eager outside the graph. |
| **`torch.compile` on sub-modules** | 1.2–2× on small ops | Medium | High | Triton-fuses HC pre/post, RMSNorm, SiLU, scale-apply. Compatibility risk: our monkey-patches and dtype tricks may confuse Dynamo. |
| **Expert bundling on disk** (one big file with O_DIRECT + io_uring) | 1.5–3× cold, marginal warm | Medium | Low | We already have `planck_io.cpp` (io_uring/O_DIRECT) in folder — used for V2 mycelium. Re-target to V4 routed experts. Eliminates 9P/DrvFs per-shard syscall overhead and enables async batched reads. **Big win for cold tokens; less for warm where OS page cache already hits.** |
| **Page-locked (pinned) routed expert RAM** | 1.1–1.3× | Medium | Medium | We dropped `pin_memory()` because 165 GB mmap can't be page-locked in 13 GB RAM. Selective pinning of an LRU slice (e.g. top-2000 experts × 12 MB = 24 GB) still needs >RAM. May only fit 500 entries (6 GB). Limited. |
| **Move WSL2 RAM cap to 24 GB** (`.wslconfig`) | 1.5–3× warm | Trivial config | Low | Larger OS page cache → far higher routed-expert hit rate. Currently 13 GB cap means OS page cache for the 132 GB FP4 monolith holds < 10 %. With 24 GB → ~18 %. With 50 GB → ~38 %. **Highest ratio of impact-to-effort.** Just edit `C:\Users\<user>\.wslconfig` `memory=24GB`. |
| **Native Linux instead of WSL2** | 1.3–2× | High (reboot, dual-boot or migration) | Low (data-side) | Eliminates 9P filesystem overhead, gives full RAM access, kernel-level page cache, real `O_DIRECT`. Probably +30–50 % throughput on warm decode, much more on cold. |
| **Faster NVMe (Gen5 instead of Gen3/4)** | 1.1–1.3× cold | Hardware $$ | Low | Cold tokens read 3–5 GB. Gen3 NVMe ~3 GB/s seq read; Gen4 ~7; Gen5 ~12. Warm tokens already RAM-bound, no gain. |
| **Skip / fallback-free experts** | 1.3–1.5× quality-degrading | Low | High | Drop top_k from 6 to 4, or skip experts whose routing weight < threshold. Saves 33–50 % of expert work. **Breaks model accuracy.** |
| **Speculative draft model + verifier** | 2–4× tokens/sec | Very high | High | Run a small/fast draft model (e.g. V2-Lite already on disk, 14 GB Q8) → propose K tokens → verify in one V4-Flash forward. If draft acceptance rate >50 %, big win. **Requires careful scaffolding.** |
| **Split V4-Flash across CPU+GPU layers** (e.g. last 10 layers on CPU AVX2 BF16) | 0.8–1.1× (often slower) | High | High | CPU BF16 matmul throughput ~50 GB/s on Zen3; far slower than GPU. Likely regression. |

---

## 6. Top-10 Bottlenecks (impact × tractability × low-risk)

Ranked by `impact / (effort × risk)`:

1. **Increase WSL2 RAM cap to 24 GB** — trivial config edit, +30–50 % on warm.
2. **Fused FP4-unpack + GEMM Triton kernel** — biggest pure-compute win, MoE drops from 5 s → 1.5 s/token.
3. **Async expert prefetch via cuda streams** — quick, hides ~10 ms/layer of PCIe.
4. **Move to native Linux** (or improve WSL2 NVMe path with ext4 vhdx) — kills 9P overhead.
5. **Switch routed-expert mmap to io_uring O_DIRECT** (reuse `planck_io.cpp`) — async batched reads, eliminates per-shard syscall.
6. **Persistent BF16 hot-cache for repeating experts** — depends on real prompts having reuse.
7. **Speculative decoding with V2-Lite draft model** — V2-Lite already runs at 3.4 s/token here; could draft 3–5 tokens for V4 to verify in one forward.
8. **CUDA Graph capture for the static prefix** (attention + HC + Compressor + gate) — eats ~1.5 s of Python overhead.
9. **`torch.compile` on RMSNorm + HC pre/post + scale_apply** — small ops, big launch-count reduction.
10. **Dropping top_k from 6 to 4 if routing weight is concentrated** — quality risk, but easy A/B.

---

## 7. Plan Sketches

### A. → 4 s/token (current 7.5–9 s)

Stack: WSL2 RAM bump + async prefetch + fused FP4 GEMM (just for MoE). Target reachable in ~1–2 days.

1. `.wslconfig: memory=24GB` → reboot WSL → re-run, expect 6 s warm floor (page cache effect alone).
2. CUDA stream overlap in `MoE.forward`: upload all 18 routed weight tensors on `stream_prefetch` while shared expert and gate compute on default stream. Sync at bmm. Expect 5–6 s.
3. Triton fused `fp4_gemm` kernel: `out[m,n] = sum_k unpack(fp4_packed[n,k//2]) * scale[n,k//32] * a_bf16[m,k]`. Replace `fp4_gemm` in `kernel_pytorch.py`. Expect 4 s.

### B. → 1 s/token

Adds: io_uring + persistent BF16 cache + CUDA Graph for static prefix + speculative decode.

4. `planck_io.cpp` rebuilt for routed-expert monolith: pack all FP4 expert weights into one 132 GB file with index, open O_DIRECT, batched async reads via io_uring.
5. Triton-fused FP8 attention path (combine `act_quant + fp8_gemm` into one kernel — eliminates ~6 ms/layer dequant scratch).
6. CUDA Graph capture for attention + HC + gate (everything except routed MoE expert calls). Expect ~1.5 s saved.
7. Speculative decoding: V2-Lite Mycelium (14 GB Q8 monolith, already runs at 3.4 s/token here) drafts 4 tokens; V4-Flash verifies in 1 forward. If 70 % acceptance: effective 1.0 s/token wall-clock.

### C. → 0.1 s/token

Requires hardware change OR major model surgery.

- Real path: rent RTX 4090 / 5090 cloud GPU (sm_89+). Native FP8 mma.sync, full 24/32 GB VRAM → entire backbone+experts fit, no streaming. ~0.1 s/token achievable trivially with their `inference/generate.py`. Cost: $0.30–0.80/hr.
- Local path: distill V4-Flash → smaller student model that fits VRAM entirely. Weeks of work.

### D. The pivot path: 100 000× via "ADAM/Planck distilled runtime"

(This is the user's actual goal — not raw V4 token-per-token, but using V4 as a *teacher* for a sovereign C++17 reasoning engine called ADAM that already runs in this repo.)

The 100 000× factor is achieved by **changing what we generate, not how fast**:

- **ADAM** is a 13–14 k concept graph with Clifford-algebra spinors, toroidal coordinates, HDC/HRR vectors, Physarum dynamics, Merkaba oscillators, GPU quantum clones. It runs at **microsecond response times** for typed-relation lookups, semantic walks, and reflex articulation. (See `ADAM_ARCHITECTURE_v5.md` in the parent `/asi/` repo.)
- The plan is: V4-Flash runs **off the hot path**, in the background, **enriching ADAM's graph** by extracting typed relations from its outputs (this is the "ribosome" pattern in user's memory).
- Online inference flow: user query → ADAM responds in microseconds from its graph → if uncertain, schedule a V4-Flash teacher pass (8 s) async → V4 output is parsed for new typed relations → ADAM's graph is updated → next time the same concept is queried, ADAM has it natively.
- Effective wall-clock per typical user turn: **<1 ms** (graph hit) for 95 %+ of queries; **8 s** when ADAM defers to V4. Average over time: <100 ms.
- That's the 100 000× — by **not running V4 on every token at all.**

Concrete next step for this path:
1. Write `flash_ribosome.py`: takes ADAM's "I don't know X" event → calls `flash_mvp.py.forward()` for ~50 tokens → parses output via existing ADAM relation-extraction code (already in `semantic.cpp` for Groq ribosome) → POSTs new relations to `/learn` endpoint.
2. Run V4 in a separate Python process, IPC via UNIX socket or named pipe (avoid model re-init per call: keep flash_mvp loaded as a daemon, ~5 GB VRAM idle).
3. Measure: how many V4 calls per typical user session? If <10 per minute, 8 s/call is fine.

---

## 8. Open questions for GPT-5.5

Things I'd like a second opinion on, ranked:

1. **Is the 0 % hot-cache hit rate (over 3 forwards on synthetic "hi" prompt) realistic, or a sign our dequant noise is breaking routing?** I.e. would a real prompt show 30 %+ reuse, validating a hot-cache strategy?
2. **For the Triton fused FP4 GEMM**, is there an existing reference (DeepGEMM? CUTLASS?) that handles `e2m1fn_x2` packed input + per-32 e8m0 scale + BF16 output, or do I write from scratch?
3. **Sparse_attn fallback** in `kernel_pytorch.py`: currently I do `einsum('bshd,bskd->bshk')` which materializes the full attention score matrix. Their tilelang version uses online softmax (FlashAttention-style) — is that worth porting in eager PyTorch, or only viable as a fused kernel?
4. **WSL2 `O_DIRECT` on 9P-mounted drvfs**: does it actually bypass the 9P caching layer, or is `O_DIRECT` silently ignored for cross-OS mounts?
5. **For the speculative-decode path**: V2-Lite's vocabulary differs from V4-Flash's tokenizer. Can we draft in a different tokenizer and re-tokenize before V4 verification, or does that defeat the speedup?
6. **Routing-weight thresholding** (drop experts with normalized weight < ε): is there any DeepSeek paper or internal discussion of how lossy this is on coding tasks vs general chat?

---

*End of brief. Source code: `/mnt/c/Users/pc/Desktop/folder/{flash_mvp.py, kernel_pytorch.py, fht_fallback.py}`. Run with `python3 flash_mvp.py --n-layers 43 --max-seq 64 --prompt "hi"`. Memory snapshot: `/home/pc/.claude/projects/-mnt-c-Users-pc-Desktop-asi/memory/project_v4flash_blocked_on_sm86.md`.*
