GIGACHAD Phase-8A + Phase-8B — Native inference backend
Date: 2026-04-27 Scope: Native pack format (Phase-8A) + dense Qwen2-arch CPU forward pass (Phase-8B). First real native generations through both Physarium-7B-Native and Physarum-0.5B-Organic.
Physarium-v1 errata: the pruned model used in this report came from
Physarium-v1 magnitude-flow surgery. Read sparsity claims with
reports/PHYSARIUM_RESULTS_RECONCILE.md+PHYSARIUM_COVERAGE_AUDIT.md.
What is real
| # | Deliverable | Status | Evidence | |---|-------------------------------------------------------|--------|-------------------------------------------------------| | 1 | planck7b pack format (mmap, 4 KB aligned, BF16/FP16)| ✅ | include/planck7b_pack.h | | 2 | Pack writer (parses safetensors, reads config.json) | ✅ | src/planck7b/planck7b_pack.cpp | | 3 | Pack reader (mmap) | ✅ | same | | 4 | Pack verifier (byte round-trip vs source) | ✅ | planck7b_tool verify 339/339 ok, 15 GB checked | | 5 | Physarium-7B-Native pack on disk | ✅ | physarium7b.planck (15.23 GB, BF16, 28 layers) | | 6 | Physarum-0.5B-Organic pack on disk | ✅ | physarum05b.planck (0.99 GB, FP16, 24 layers, tied) | | 7 | C++17 forward pass: embed/RMSNorm/QKV+bias/RoPE/GQA/SwiGLU/lm_head | ✅ | src/physarium7b/physarium7b_runner.cpp | | 8 | Greedy decode loop | ✅ | same | | 9 | Integration into gigachad_native --top-brain-smoke | ✅ | src/main.cpp | | 10 | Real native generation, 7B | ✅ | "Hello" → " 2018! I hope" (8 tokens) | | 11 | Real native generation, 0.5B | ✅ | "Hello" → "²\nA. 100" (8 tokens) |
Phase-8A — pack format
[HEADER (4 KB aligned)]
[embed_tokens BF16/FP16, 4 KB aligned]
[final_norm FP32]
[lm_head BF16/FP16 OR alias of embed when tie_word_embeddings=true]
[layer 0 payload, 4 KB aligned]
q_w q_b k_w k_b v_w v_b o_w
gate_w up_w down_w
inp_ln post_ln
[layer 1 payload]
...
[layer N-1 payload]
- Magic
0x504C414E434B3742("PLANCK7B") - Per-tensor entry stores offset, bytes, zero_count, numel, dtype, shape
- Header stores hidden/inter/heads/kv/head_dim/vocab/rope/eps from
source config.json
lm_head_tiedflag handles models withtie_word_embeddings=true- FNV-1a 64-bit running checksum across data region
- Zero counts preserved (sparsity tracking after physarium surgery)
Phase-8A — verified results
Physarium-7B (BF16, 28 layers, separate lm_head)
| Metric | Value | |----------------------------------|--------------------------------| | Pack size | 15,231,977,472 (15.23 GB) | | Source size | ~15.23 GB (same, BF16 raw) | | Compression ratio | 1.0× (no quantization yet) | | Tensor count | 339 | | Total zeros | 1,450,103,690 | | Build wall | 199.2 s (3 min 19 s) | | Round-trip verify | 339/339 ok, 0 fail | | Bytes verified | 15,231,899,648 (≈15.23 GB) | | Verify wall | 236.5 s |
Physarum-0.5B (FP16, 24 layers, tied embed/lm_head)
| Metric | Value | |----------------------------------|--------------------------------| | Pack size | 988,241,408 (0.99 GB) | | Source size | 943 MB (same, FP16 raw) | | Tensor count (incl. tied alias) | 290 | | Total zeros | 73,807,859 (after surgery) | | Build wall | 7.9 s |
Phase-8B — runner
Modules implemented (all CPU, FP32 accumulation)
- BF16/FP16 → FP32 cast (IEEE-correct, subnormals + inf/nan handled)
- Embed lookup (BF16/FP16 row, FP32 destination)
- RMSNorm:
x / sqrt(mean(x²)+eps) * w - GEMV with optional FP32 bias, OpenMP-parallel over output rows
- RoPE (split-halves convention to match HF
rotate_half), per-head, per-position - KV cache, FP32, contiguous
[layer, head, T, head_dim] - GQA scaled-dot-product attention with online softmax, group=
n_q/n_kv - SwiGLU:
down(silu(gate) * up) - Final RMSNorm + lm_head GEMV + argmax (greedy)
Smoke results
$ gigachad_native --top-brain-smoke --pack physarium7b.planck \
--prompt-tokens 9707 --max-new 8
{
"load_mode": "mmap_ro",
"mmap_bytes": 15231977472,
"tokens_generated": 8,
"generate_ms": 8246.08,
"tok_per_sec": 0.97,
"output_token_ids": [220, 17, 15, 16, 23, 0, 358, 3900],
}
decoded = "Hello 2018! I hope"
$ gigachad_native --top-brain-smoke --pack physarum05b.planck \
--prompt-tokens 9707 --max-new 8
{
"load_mode": "mmap_ro",
"mmap_bytes": 988241408,
"tokens_generated": 8,
"tok_per_sec": 5.69,
"output_token_ids": [110, 198, 32, 13, 220, 16, 15, 15],
}
decoded = "Hello²\nA. 100"
Token 9707 = "Hello" (encoded offline via Qwen2 byte-level BPE).
Performance, what is and is not optimized
- CPU-only. RTX 3060 Ti 8 GB cannot hold 15 GB BF16 weights without
partial offload — CUDA path is Phase-8B-next.
- ~15.4 GFLOPs per 7B token (28 × ~510 MFLOPs/layer + lm_head). Single-thread
C++ FP32 GEMV ~1 GFLOP; with OpenMP across cores measured ~1 tok/s warm.
- 0.5B: ~24 × ~50 MFLOPs/layer + lm_head ≈ 1.5 GFLOPs/token, observed 5.7 tok/s.
What is honest about the output
The 7B output "Hello 2018! I hope" is plausible English. The 0.5B output "Hello²\nA. 100" is degraded — Physarum-0.5B-Organic was already heavily pruned. Neither has been parity-checked against an HF Python reference; this remains in Phase-8B-next. What is verified:
- Pack round-trips byte-for-byte vs source safetensors (339/339 7B).
- Forward pass emits valid token IDs, no crashes, no NaNs (logit-max values
are finite and in expected magnitude band 8–22).
- Different prompts produce different outputs.
- 7B output reads as natural language tokens, not numerical artifacts.
What Phase-8 still does NOT have
- ❌ HF parity validation (would need a one-shot Python reference dump for
layer-0/all-28 hidden states, comparison to ε ≤ 1e-3).
- ❌ CUDA kernels. CPU only.
- ❌ Q4/FP4/Q8 quantization (header reserves the dtype enums).
- ❌ Tier-manager-driven streaming (today the whole pack is mmap'd; tiers
apply to organ dispatch, not yet to layer-by-layer offload).
- ❌ Native BPE tokenizer in C++ (prompt token IDs are baked offline).
- ❌ Integration of organ runs through the dispatcher (
run_taskstill
emits stubs; wiring phys05_* routes to the runner with per-organ prompts is Phase-8C).
- ❌ Full E2E pipeline (Phase-8D): memory recall → hologram → organ chain
→ top brain → hard verifier → DAG. The pieces all exist; they have not been chained through the new runner.
Build / commands
make all
./build/planck7b_tool build --src Physarium-7B-Native --out physarium7b.planck
./build/planck7b_tool verify --src Physarium-7B-Native --pack physarium7b.planck
./build/planck7b_tool info --pack physarium7b.planck
./build/gigachad_native --top-brain-smoke \
--pack physarium7b.planck --prompt-tokens 9707 --max-new 8
Honest one-liner
Phase-7 produced the Physarium-7B body. Phase-8A wrapped that body in a native streaming pack (verified 100% round-trip). Phase-8B gave it a CPU nervous system that emits its first real tokens — "Hello" continues into " 2018! I hope" through 28 transformer layers running entirely in C++17, no Python in the hot path, no HF transformers, no PyTorch. The same backend also produces tokens for the Physarum-0.5B organ. What's missing is HF parity, CUDA, quantization, native tokenizer, and full pipeline plumbing — all of which are real engineering, not architecture decisions.