GIGACHAD Phase-8A + Phase-8B — Native inference backend

Date: 2026-04-27 Scope: Native pack format (Phase-8A) + dense Qwen2-arch CPU forward pass (Phase-8B). First real native generations through both Physarium-7B-Native and Physarum-0.5B-Organic.

Physarium-v1 errata: the pruned model used in this report came from

Physarium-v1 magnitude-flow surgery. Read sparsity claims with

reports/PHYSARIUM_RESULTS_RECONCILE.md + PHYSARIUM_COVERAGE_AUDIT.md.

What is real

| # | Deliverable | Status | Evidence | |---|-------------------------------------------------------|--------|-------------------------------------------------------| | 1 | planck7b pack format (mmap, 4 KB aligned, BF16/FP16)| ✅ | include/planck7b_pack.h | | 2 | Pack writer (parses safetensors, reads config.json) | ✅ | src/planck7b/planck7b_pack.cpp | | 3 | Pack reader (mmap) | ✅ | same | | 4 | Pack verifier (byte round-trip vs source) | ✅ | planck7b_tool verify 339/339 ok, 15 GB checked | | 5 | Physarium-7B-Native pack on disk | ✅ | physarium7b.planck (15.23 GB, BF16, 28 layers) | | 6 | Physarum-0.5B-Organic pack on disk | ✅ | physarum05b.planck (0.99 GB, FP16, 24 layers, tied) | | 7 | C++17 forward pass: embed/RMSNorm/QKV+bias/RoPE/GQA/SwiGLU/lm_head | ✅ | src/physarium7b/physarium7b_runner.cpp | | 8 | Greedy decode loop | ✅ | same | | 9 | Integration into gigachad_native --top-brain-smoke | ✅ | src/main.cpp | | 10 | Real native generation, 7B | ✅ | "Hello" → " 2018! I hope" (8 tokens) | | 11 | Real native generation, 0.5B | ✅ | "Hello" → "²\nA. 100" (8 tokens) |

Phase-8A — pack format

[HEADER (4 KB aligned)]
[embed_tokens BF16/FP16, 4 KB aligned]
[final_norm FP32]
[lm_head BF16/FP16  OR  alias of embed when tie_word_embeddings=true]
[layer 0 payload, 4 KB aligned]
  q_w  q_b  k_w  k_b  v_w  v_b  o_w
  gate_w  up_w  down_w
  inp_ln  post_ln
[layer 1 payload]
...
[layer N-1 payload]

Magic 0x504C414E434B3742 ("PLANCK7B")
Per-tensor entry stores offset, bytes, zero_count, numel, dtype, shape
Header stores hidden/inter/heads/kv/head_dim/vocab/rope/eps from

source config.json

lm_head_tied flag handles models with tie_word_embeddings=true
FNV-1a 64-bit running checksum across data region
Zero counts preserved (sparsity tracking after physarium surgery)

Phase-8A — verified results

Physarium-7B (BF16, 28 layers, separate lm_head)

| Metric | Value | |----------------------------------|--------------------------------| | Pack size | 15,231,977,472 (15.23 GB) | | Source size | ~15.23 GB (same, BF16 raw) | | Compression ratio | 1.0× (no quantization yet) | | Tensor count | 339 | | Total zeros | 1,450,103,690 | | Build wall | 199.2 s (3 min 19 s) | | Round-trip verify | 339/339 ok, 0 fail | | Bytes verified | 15,231,899,648 (≈15.23 GB) | | Verify wall | 236.5 s |

Physarum-0.5B (FP16, 24 layers, tied embed/lm_head)

| Metric | Value | |----------------------------------|--------------------------------| | Pack size | 988,241,408 (0.99 GB) | | Source size | 943 MB (same, FP16 raw) | | Tensor count (incl. tied alias) | 290 | | Total zeros | 73,807,859 (after surgery) | | Build wall | 7.9 s |

Phase-8B — runner

Modules implemented (all CPU, FP32 accumulation)

BF16/FP16 → FP32 cast (IEEE-correct, subnormals + inf/nan handled)
Embed lookup (BF16/FP16 row, FP32 destination)
RMSNorm: x / sqrt(mean(x²)+eps) * w
GEMV with optional FP32 bias, OpenMP-parallel over output rows
RoPE (split-halves convention to match HF rotate_half), per-head, per-position
KV cache, FP32, contiguous [layer, head, T, head_dim]
GQA scaled-dot-product attention with online softmax, group=n_q/n_kv
SwiGLU: down(silu(gate) * up)
Final RMSNorm + lm_head GEMV + argmax (greedy)

Smoke results

$ gigachad_native --top-brain-smoke --pack physarium7b.planck \
                   --prompt-tokens 9707 --max-new 8
{
  "load_mode":         "mmap_ro",
  "mmap_bytes":        15231977472,
  "tokens_generated":  8,
  "generate_ms":       8246.08,
  "tok_per_sec":       0.97,
  "output_token_ids":  [220, 17, 15, 16, 23, 0, 358, 3900],
}

decoded = "Hello 2018! I hope"

$ gigachad_native --top-brain-smoke --pack physarum05b.planck \
                   --prompt-tokens 9707 --max-new 8
{
  "load_mode":         "mmap_ro",
  "mmap_bytes":        988241408,
  "tokens_generated":  8,
  "tok_per_sec":       5.69,
  "output_token_ids":  [110, 198, 32, 13, 220, 16, 15, 15],
}

decoded = "Hello²\nA. 100"

Token 9707 = "Hello" (encoded offline via Qwen2 byte-level BPE).

Performance, what is and is not optimized

CPU-only. RTX 3060 Ti 8 GB cannot hold 15 GB BF16 weights without

partial offload — CUDA path is Phase-8B-next.

~15.4 GFLOPs per 7B token (28 × ~510 MFLOPs/layer + lm_head). Single-thread

C++ FP32 GEMV ~1 GFLOP; with OpenMP across cores measured ~1 tok/s warm.

0.5B: ~24 × ~50 MFLOPs/layer + lm_head ≈ 1.5 GFLOPs/token, observed 5.7 tok/s.

What is honest about the output

The 7B output "Hello 2018! I hope" is plausible English. The 0.5B output "Hello²\nA. 100" is degraded — Physarum-0.5B-Organic was already heavily pruned. Neither has been parity-checked against an HF Python reference; this remains in Phase-8B-next. What is verified:

Pack round-trips byte-for-byte vs source safetensors (339/339 7B).
Forward pass emits valid token IDs, no crashes, no NaNs (logit-max values

are finite and in expected magnitude band 8–22).

Different prompts produce different outputs.
7B output reads as natural language tokens, not numerical artifacts.

What Phase-8 still does NOT have

❌ HF parity validation (would need a one-shot Python reference dump for

layer-0/all-28 hidden states, comparison to ε ≤ 1e-3).

❌ CUDA kernels. CPU only.
❌ Q4/FP4/Q8 quantization (header reserves the dtype enums).
❌ Tier-manager-driven streaming (today the whole pack is mmap'd; tiers

apply to organ dispatch, not yet to layer-by-layer offload).

❌ Native BPE tokenizer in C++ (prompt token IDs are baked offline).
❌ Integration of organ runs through the dispatcher (run_task still

emits stubs; wiring phys05_* routes to the runner with per-organ prompts is Phase-8C).

❌ Full E2E pipeline (Phase-8D): memory recall → hologram → organ chain

→ top brain → hard verifier → DAG. The pieces all exist; they have not been chained through the new runner.

Build / commands

make all
./build/planck7b_tool build  --src Physarium-7B-Native      --out physarium7b.planck
./build/planck7b_tool verify --src Physarium-7B-Native      --pack physarium7b.planck
./build/planck7b_tool info   --pack physarium7b.planck

./build/gigachad_native --top-brain-smoke \
    --pack physarium7b.planck --prompt-tokens 9707 --max-new 8

Honest one-liner

Phase-7 produced the Physarium-7B body. Phase-8A wrapped that body in a native streaming pack (verified 100% round-trip). Phase-8B gave it a CPU nervous system that emits its first real tokens — "Hello" continues into " 2018! I hope" through 28 transformer layers running entirely in C++17, no Python in the hot path, no HF transformers, no PyTorch. The same backend also produces tokens for the Physarum-0.5B organ. What's missing is HF parity, CUDA, quantization, native tokenizer, and full pipeline plumbing — all of which are real engineering, not architecture decisions.