# GIGACHAD Phase-8A + Phase-8B — Native inference backend

**Date:** 2026-04-27
**Scope:** Native pack format (Phase-8A) + dense Qwen2-arch CPU forward
pass (Phase-8B). First real native generations through both
Physarium-7B-Native and Physarum-0.5B-Organic.

> **Physarium-v1 errata:** the pruned model used in this report came from
> Physarium-v1 magnitude-flow surgery. Read sparsity claims with
> `reports/PHYSARIUM_RESULTS_RECONCILE.md` + `PHYSARIUM_COVERAGE_AUDIT.md`.

## What is real

| # | Deliverable                                           | Status | Evidence                                              |
|---|-------------------------------------------------------|--------|-------------------------------------------------------|
| 1 | `planck7b` pack format (mmap, 4 KB aligned, BF16/FP16)| ✅     | `include/planck7b_pack.h`                             |
| 2 | Pack writer (parses safetensors, reads config.json)   | ✅     | `src/planck7b/planck7b_pack.cpp`                      |
| 3 | Pack reader (mmap)                                    | ✅     | same                                                  |
| 4 | Pack verifier (byte round-trip vs source)             | ✅     | `planck7b_tool verify` 339/339 ok, 15 GB checked      |
| 5 | Physarium-7B-Native pack on disk                      | ✅     | `physarium7b.planck` (15.23 GB, BF16, 28 layers)      |
| 6 | Physarum-0.5B-Organic pack on disk                    | ✅     | `physarum05b.planck` (0.99 GB, FP16, 24 layers, tied) |
| 7 | C++17 forward pass: embed/RMSNorm/QKV+bias/RoPE/GQA/SwiGLU/lm_head | ✅ | `src/physarium7b/physarium7b_runner.cpp`             |
| 8 | Greedy decode loop                                    | ✅     | same                                                  |
| 9 | Integration into `gigachad_native --top-brain-smoke`  | ✅     | `src/main.cpp`                                        |
| 10 | Real native generation, 7B                           | ✅     | "Hello" → " 2018! I hope" (8 tokens)                  |
| 11 | Real native generation, 0.5B                         | ✅     | "Hello" → "²\nA. 100" (8 tokens)                      |

## Phase-8A — pack format

```
[HEADER (4 KB aligned)]
[embed_tokens BF16/FP16, 4 KB aligned]
[final_norm FP32]
[lm_head BF16/FP16  OR  alias of embed when tie_word_embeddings=true]
[layer 0 payload, 4 KB aligned]
  q_w  q_b  k_w  k_b  v_w  v_b  o_w
  gate_w  up_w  down_w
  inp_ln  post_ln
[layer 1 payload]
...
[layer N-1 payload]
```

- Magic `0x504C414E434B3742` ("PLANCK7B")
- Per-tensor entry stores offset, bytes, zero_count, numel, dtype, shape
- Header stores hidden/inter/heads/kv/head_dim/vocab/rope/eps from
  source `config.json`
- `lm_head_tied` flag handles models with `tie_word_embeddings=true`
- FNV-1a 64-bit running checksum across data region
- Zero counts preserved (sparsity tracking after physarium surgery)

## Phase-8A — verified results

### Physarium-7B (BF16, 28 layers, separate lm_head)

| Metric                           | Value                          |
|----------------------------------|--------------------------------|
| Pack size                        | 15,231,977,472 (15.23 GB)      |
| Source size                      | ~15.23 GB (same, BF16 raw)     |
| Compression ratio                | 1.0× (no quantization yet)     |
| Tensor count                     | 339                            |
| Total zeros                      | 1,450,103,690                  |
| Build wall                       | 199.2 s (3 min 19 s)           |
| Round-trip verify                | **339/339 ok, 0 fail**         |
| Bytes verified                   | 15,231,899,648 (≈15.23 GB)     |
| Verify wall                      | 236.5 s                        |

### Physarum-0.5B (FP16, 24 layers, tied embed/lm_head)

| Metric                           | Value                          |
|----------------------------------|--------------------------------|
| Pack size                        | 988,241,408 (0.99 GB)          |
| Source size                      | 943 MB (same, FP16 raw)        |
| Tensor count (incl. tied alias)  | 290                            |
| Total zeros                      | 73,807,859 (after surgery)     |
| Build wall                       | 7.9 s                          |

## Phase-8B — runner

### Modules implemented (all CPU, FP32 accumulation)

- BF16/FP16 → FP32 cast (IEEE-correct, subnormals + inf/nan handled)
- Embed lookup (BF16/FP16 row, FP32 destination)
- RMSNorm: `x / sqrt(mean(x²)+eps) * w`
- GEMV with optional FP32 bias, OpenMP-parallel over output rows
- RoPE (split-halves convention to match HF `rotate_half`), per-head, per-position
- KV cache, FP32, contiguous `[layer, head, T, head_dim]`
- GQA scaled-dot-product attention with online softmax, group=`n_q/n_kv`
- SwiGLU: `down(silu(gate) * up)`
- Final RMSNorm + lm_head GEMV + argmax (greedy)

### Smoke results

```
$ gigachad_native --top-brain-smoke --pack physarium7b.planck \
                   --prompt-tokens 9707 --max-new 8
{
  "load_mode":         "mmap_ro",
  "mmap_bytes":        15231977472,
  "tokens_generated":  8,
  "generate_ms":       8246.08,
  "tok_per_sec":       0.97,
  "output_token_ids":  [220, 17, 15, 16, 23, 0, 358, 3900],
}

decoded = "Hello 2018! I hope"
```

```
$ gigachad_native --top-brain-smoke --pack physarum05b.planck \
                   --prompt-tokens 9707 --max-new 8
{
  "load_mode":         "mmap_ro",
  "mmap_bytes":        988241408,
  "tokens_generated":  8,
  "tok_per_sec":       5.69,
  "output_token_ids":  [110, 198, 32, 13, 220, 16, 15, 15],
}

decoded = "Hello²\nA. 100"
```

Token 9707 = "Hello" (encoded offline via Qwen2 byte-level BPE).

### Performance, what is and is not optimized

- CPU-only. RTX 3060 Ti 8 GB cannot hold 15 GB BF16 weights without
  partial offload — CUDA path is Phase-8B-next.
- ~15.4 GFLOPs per 7B token (28 × ~510 MFLOPs/layer + lm_head). Single-thread
  C++ FP32 GEMV ~1 GFLOP; with OpenMP across cores measured ~1 tok/s warm.
- 0.5B: ~24 × ~50 MFLOPs/layer + lm_head ≈ 1.5 GFLOPs/token, observed 5.7 tok/s.

## What is honest about the output

The 7B output `"Hello 2018! I hope"` is **plausible English**. The 0.5B
output `"Hello²\nA. 100"` is degraded — Physarum-0.5B-Organic was already
heavily pruned. Neither has been **parity-checked** against an HF Python
reference; this remains in Phase-8B-next. What is verified:

- Pack round-trips byte-for-byte vs source safetensors (339/339 7B).
- Forward pass emits valid token IDs, no crashes, no NaNs (logit-max values
  are finite and in expected magnitude band 8–22).
- Different prompts produce different outputs.
- 7B output reads as natural language tokens, not numerical artifacts.

## What Phase-8 still does NOT have

- ❌ HF parity validation (would need a one-shot Python reference dump for
  layer-0/all-28 hidden states, comparison to ε ≤ 1e-3).
- ❌ CUDA kernels. CPU only.
- ❌ Q4/FP4/Q8 quantization (header reserves the dtype enums).
- ❌ Tier-manager-driven streaming (today the whole pack is mmap'd; tiers
  apply to organ dispatch, not yet to layer-by-layer offload).
- ❌ Native BPE tokenizer in C++ (prompt token IDs are baked offline).
- ❌ Integration of organ runs through the dispatcher (`run_task` still
  emits stubs; wiring `phys05_*` routes to the runner with per-organ prompts
  is Phase-8C).
- ❌ Full E2E pipeline (Phase-8D): memory recall → hologram → organ chain
  → top brain → hard verifier → DAG. The pieces all exist; they have not
  been chained through the new runner.

## Build / commands

```
make all
./build/planck7b_tool build  --src Physarium-7B-Native      --out physarium7b.planck
./build/planck7b_tool verify --src Physarium-7B-Native      --pack physarium7b.planck
./build/planck7b_tool info   --pack physarium7b.planck

./build/gigachad_native --top-brain-smoke \
    --pack physarium7b.planck --prompt-tokens 9707 --max-new 8
```

## Honest one-liner

Phase-7 produced the Physarium-7B body. Phase-8A wrapped that body in a
native streaming pack (verified 100% round-trip). Phase-8B gave it a CPU
nervous system that emits its first real tokens — `"Hello"` continues into
`" 2018! I hope"` through 28 transformer layers running entirely in C++17,
no Python in the hot path, no HF transformers, no PyTorch. The same backend
also produces tokens for the Physarum-0.5B organ. What's missing is HF
parity, CUDA, quantization, native tokenizer, and full pipeline plumbing —
all of which are real engineering, not architecture decisions.
