# DeepSeek-V4-Flash Python Pipeline — Open-Source Reference

**Status:** Working reference. Produces correct Paris top-1 answer with logit margin 11.1 above #2 (France).

This document captures the WORKING Python inference pipeline for DeepSeek-V4-Flash (284B / 13B-active MoE, FP4+FP8 mixed precision, 1M-token context) on a single RTX 3060 Ti via WSL2 — for open-source release + handoff to another agent.

---

## 1. What it does

End-to-end inference: tokenize chat prompt → run all 43 layers → produce next-token logits.

**Verified correctness** (2026-05-31):
- Prompt: `"What is the capital of France? Answer in one word."` (wrapped via DeepSeek-V4 chat template)
- Top-10 logits: **Paris (51119, +40.75), France (51725, +29.62), The (671, +28.75), Berlin (93689, +28.50), Capital (59201, +28.25), London, **, Nice, L, PAR**
- Paris ranks #1 with +11.13 logit margin

**Resource profile on RTX 3060 Ti / 32 GB RAM / WSL2:**
- Materialize backbone to GPU: ~327 s (one-time)
- Single forward (16-token chat prompt, 43 layers): ~324 s
- Memory: ~14 GB system RAM, ~7 GB VRAM

---

## 2. Files & roles

All located in `/mnt/c/Users/pc/Desktop/folder/` (project root):

| File | Purpose |
|---|---|
| `flash_mvp.py` | Main inference entry: loads weights, runs forward, prints predicted token |
| `flash_mvp_chat.py` | Chat-template wrapper — applies DeepSeek-V4 chat template via `encoding_dsv4` before calling `flash_mvp.main` |
| `kernel_pytorch.py` | Pure-PyTorch replacements for tilelang kernels — `fp8_gemm`, `fp4_gemm`, `act_quant`, `fp4_act_quant`, `sparse_attn`, `hc_split_sinkhorn`. Registers itself as `sys.modules["kernel"]` |
| `fht_fallback.py` | Hadamard transform fallback for FP4 quantization path |
| `kernel_native_hook.py` | (Optional) wire-points for native CUDA kernels |
| `dump_ref_v4.py` | Monkey-patched forward that saves per-layer activations to `/tmp/ref_dump/` for numerical validation |
| `bpe_tokenizer.cpp`/`.h` | Pure C++ BPE tokenizer that matches HuggingFace PreTrainedTokenizerFast bit-for-bit (74/74 parity tests) |

**Model directory** (not in this repo — required separately):
- `/mnt/d/v4_original/` (or any path) — official DeepSeek-V4-Flash checkpoint from HuggingFace
- Contents: `config.json`, `generation_config.json`, 46× `model-XXXXX-of-00046.safetensors` (~152 GB total), `inference/model.py`, `inference/convert.py`, `inference/kernel.py`, `encoding/encoding_dsv4.py`

---

## 3. How to run (Paris-correctness end-to-end)

### Setup
```bash
# Python ≥ 3.10, torch with CUDA, transformers, safetensors, numpy
pip install torch transformers safetensors numpy

# Set model dir (with safetensors shards)
export V4_MODEL=/mnt/d/v4_original
```

### Minimal Paris test
```bash
cd /mnt/c/Users/pc/Desktop/folder
python3 flash_mvp_chat.py \
    --ckpt $V4_MODEL \
    --user "What is the capital of France? Answer in one word." \
    --mode chat \
    --max-seq 64
```
Expected: first predicted token id 51119 (`'Paris'`), logit ~40.75.

### Reference activation dump (for cross-validating ports / native runtime)
```bash
REF_N_LAYERS=43 python3 dump_ref_v4.py
```
Writes `/tmp/ref_dump/`:
- `final_logits_full.npy` (1, 129280) — full vocab logits
- `final_logits_top10_indices.npy` + `_values.npy`
- 18 layer-0 activation files (`L0_Block_*`, `L0_Attention_*`, `L0_MoE_*`, `L0_Gate_*`, `L0_RMSNorm_*`)
- Per-checkpoint-layer inputs: `L{1,2,3,5,7,10,20,30,42}_layer_input.npy`
- `final_pre_head.npy`

Configuration knobs (via env):
- `REF_N_LAYERS` — number of Block layers to run (default 1 for fast iteration, **must be 43 for correctness**)
- `CHECKPOINT_LAYERS` — which layer indices to dump inputs at (edit script)

---

## 4. Architecture nuances that matter

These are subtleties learned during native-port debugging. Anyone reimplementing V4-Flash should know them.

### 4.1 Chat template (critical — model gives garbage without it)
Wrap user input as:
```
<｜begin▁of▁sentence｜><｜User｜>{user_message}<｜Assistant｜></think>
```
The trailing `</think>` signals non-thinking mode → model answers directly. Available via `encoding_dsv4.encode_messages(messages, thinking_mode="chat")`.

### 4.2 First 3 MoE layers use Hash routing (not gate top-k)
- `n_hash_layers = 3` → L=0, 1, 2 route via `tid2eid[input_id]` lookup table
- Subsequent layers use gate top-k with `sqrt(softplus(x @ W^T))` scoring + bias adjustment (noaux_tc)
- BUT: **even hash-routed layers compute weights from `original_scores = sqrt(softplus(x @ W^T))`** then `gather(tid2eid_indices)`, normalize, and multiply by `routed_scaling_factor = 1.5`. A naive implementation that just uses uniform `1/top_k` weights produces wrong magnitudes.

### 4.3 KV cache: separate per-layer slabs with different sizes
- `compress_ratios[L]` per layer: `[0, 0, 4, 128, 4, 128, …, 4, 0]` (43 entries)
- ratio=0: pure Sliding Window Attention (SWA), 128-token window, raw KV cache
- ratio=4: Compressed Sparse Attention (CSA), KV cache size `max_seq/4`
- ratio=128: Heavily Compressed Attention (HCA), KV cache size `max_seq/128`

### 4.4 Compressor (CSA/HCA) — overlap split semantics
For CSA (m=4 with overlap):
- `wkv` weight shape `[2*head_dim=1024, hidden=4096]` BF16
- First 512 rows = "a-portion" (used for PREVIOUS chunk's tokens in the overlap window)
- Last 512 rows = "b-portion" (used for CURRENT chunk's tokens)
- Same split for `wgate`
- `ape` shape `[m=4, 2*head_dim=1024]`: split same way into `B_a` + `B_b`

For HCA (m=128, no overlap): single weight (no a/b split).

After compressor:
1. RMSNorm with `compressor.norm.weight`
2. RoPE on last 64 dims at positions `[0, m, 2m, 3m, …]` (= `freqs_cis[:cutoff:ratio]`)
3. FP8 round-trip simulation on first `head_dim - rope_dim = 448` dims (`act_quant(kv[..., :-rd], 64, …, True)`)

### 4.5 Attention sink + per-head Q RMS + inverse RoPE on output
- Each layer has `attn_sink` learnable [n_heads] FP32 — added to softmax denominator (paper §2.3.3)
- After `wq_a` → `q_norm` → `wq_b`: apply per-head RMS (no learned weight): `q *= rsqrt(q.square().mean(-1) + eps)`
- After attention, BEFORE grouped output projection: **apply INVERSE RoPE** to last 64 dims of attention output (paper §2.3.3 "Partial RoPE"): `apply_rotary_emb(o[..., -rd:], freqs_cis, inverse=True)`

### 4.6 mHC (Manifold-Constrained Hyper-Connections) per layer
- Residual stream has `n_hc = 4` channels (`[B, S, 4, hidden]`)
- Each block has TWO hc_pre/hc_post mixings — one around attention, one around MoE
- `A_l = sigmoid(Ã_l)` → input mapping `[1, hc]` ∈ [0, 1]
- `C_l = 2 * sigmoid(C̃_l)` → output mapping `[hc, 1]` ∈ [0, 2] (note the `2 *` factor)
- `B_l` = Sinkhorn-Knopp projected doubly-stochastic matrix `[hc, hc]`, 20 iterations

### 4.7 SwiGLU clamping (asymmetric)
- Linear (up) component: clamp `[-swiglu_limit, swiglu_limit]` = `[-10, 10]`
- Gate component: cap **upper only** at `swiglu_limit = 10` (no lower cap)

### 4.8 act_quant convention
Two distinct uses of `act_quant`:
- **For GEMM input** (default): returns `(y_bf16_quantized_raw, scale_fp32)`. `y` is the raw FP8 grid value cast to BF16. Subsequent GEMM applies `scale_a * scale_b` internally.
- **For inplace simulation** (`inplace=True`): writes back `y_dequant * scale` (round-trip reconstruction). Used by Compressor on non-rope dims.

These are mathematically inverse — confusing them = silent 1000× underflow in MoE.

---

## 5. Per-layer activation reference (dump_ref_v4.py output)

Layer-0 detailed dumps (21 .npy files, each `float32`):

| File | Shape | Meaning |
|---|---|---|
| `L0_Block_00_input.npy` | (1, 13, 4, 4096) | HC-stacked input to L=0 |
| `L0_Block_00_output.npy` | (1, 13, 4, 4096) | HC-stacked output of L=0 |
| `L0_RMSNorm_00_input.npy` | (1, 13, 4096) | Input to attn_norm (= HC-collapsed via hc_pre) |
| `L0_RMSNorm_00_output.npy` | (1, 13, 4096) | Output of attn_norm |
| `L0_Attention_00_input.npy` | (1, 13, 4096) | Same as RMSNorm_00_output |
| `L0_Attention_00_output.npy` | (1, 13, 4096) | Attention sub-block output |
| `L0_RMSNorm_01_*` | (1, 13, 1024) | q_norm (in/out) |
| `L0_RMSNorm_02_*` | (1, 13, 512) | kv_norm (in/out) |
| `L0_RMSNorm_03_*` | (1, 13, 4096) | ffn_norm (in/out) |
| `L0_Gate_00/01_*` | (13, 6) | Top-k gate weights (hash routing returns indices+weights twice — once for moe_batched, once for shared dispatch) |
| `L0_MoE_00_input.npy` | (1, 13, 4096) | MoE input |
| `L0_MoE_00_output.npy` | (1, 13, 4096) | MoE output |

Deeper-layer dumps: `L{1,2,3,5,7,10,20,30,42}_layer_input.npy` — flat 262144 = `1*16*4*4096` (Block input as `[B, S, HC, H]`).

Final logits: `final_pre_head.npy` (4096), `final_logits_full.npy` (129280).

---

## 6. Numerical validation framework

`compare_dumps.py` cross-checks the C++ runtime (or any port) against this Python reference.

```bash
python3 compare_dumps.py
```

For each checkpoint, computes:
- Cosine similarity (sliced/reshaped to match)
- RMS ratio (`cpp_rms / ref_rms`)
- Max abs diff

Flags `DIVERGED` if cos < 0.95 or ratio outside [0.7, 1.4]. Identifies first divergence point + binary-search hint for adding intermediate checkpoints.

End-of-network top-5 logit comparison — explicit "Paris" detection.

---

## 7. Known-correct token sequence for the test prompt

For `"What is the capital of France? Answer in one word."`:

```
Prompt tokenized (16 tokens):
[0=BOS, 128803=<User>, 3085=What, 344=' is', 270=' the',
 6102=' capital', 294=' of', 8760=' France', 33='?',
 9361=' Answer', 295=' in', 834=' one', 2004=' word', 16='.',
 128804=<Assistant>, 128822=</think>]

First predicted next-token: 51119 = 'Paris'  (logit +40.75)
```

---

## 8. What's NOT in this Python pipeline (deferred to native runtime)

- **Speed**: Python forward = ~324 s for 1 token. Not for production. The native C++/CUDA runtime in development targets 10 tok/sec.
- **CUDA Graph capture**: Python re-runs all kernels per token.
- **FP4 expert pool / managed memory caching**: Python lazy-loads weights per forward, no expert resident cache.
- **Custom indexer kernel**: Python uses `get_compress_topk_idxs` (rule-based) or full sparse_attn lookup; the Lightning Indexer FP4 gate path is not exercised at short context.

---

## 9. License / attribution

- DeepSeek-V4-Flash model: MIT license, © DeepSeek-AI 2026
- Surgery harness, dump scripts, comparison framework: this repository, MIT recommended

For citation: `DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence` (DeepSeek-AI, 2026), `DeepSeek_V4.pdf` in model dir.

---

## 10. Handoff checklist for another agent

When picking up this work, an agent should:

1. ✅ Verify Python ref produces Paris top-1 — `python3 flash_mvp_chat.py --ckpt $V4_MODEL --user "What is the capital of France? Answer in one word." --mode chat`
2. ✅ Regenerate ref activations — `REF_N_LAYERS=43 python3 dump_ref_v4.py` (≈11 min on 3060 Ti)
3. ✅ Read this document + `V4_FLASH_TECH_BRIEF.md` (in doctrine-pack)
4. ✅ Read `/mnt/d/v4_original/inference/model.py` end-to-end — single source of truth for architecture
5. ✅ Read `DeepSeek_V4.pdf` Sections 2.1 (MoE), 2.2 (mHC), 2.3 (Hybrid Attention CSA+HCA)
6. ⏭️ Use `compare_dumps.py` as test oracle for any port