Hardware calculator · real benchmarks · 2026

Will it run AI?

Tell us your GPU + RAM + SSD. We'll show every model that runs — natively, with CPU offload, with expert streaming, or via disk streaming. Almost any model can run somehow — the question is at what speed. We're upfront about each tier.

Anchor — DeepSeek V4-Flash, closed as Surgery Case 01. A 284 B-total / 13 B-active MoE from DeepSeek (159 GB weights, 46 shards, ~69 k tensors) running end-to-end on a single 8 GB RTX 3060 Ti through our own native C++/CUDA inference engine. Decode: 1.86 tok/s real-weight warm → 0.16 tok/s on the full 43-layer text loop. Bottleneck is disk I/O, not compute — our PLANCK_PACK contiguous expert layout gave a 6.5× DMA speed-up (0.13 → 0.85 GB/s) and we still call it disk-bound. Not a chat benchmark — a doctrine demonstration. See V4-Flash tech brief. Baseline check on the same card: Physarium-7B Q4 at 83.58 tok/s via Frankenstellm — floor reference, not a feat.

⭐ marks our reference card — every number on this site is measured on it.

Used for CPU offload — layers that don't fit in VRAM run from system memory at ~3–8× the latency. This unlocks bigger models, just slower.

NVMe SSD enables expert streaming — load only the active expert weights from disk per token. Lets a 284 B-total / 13 B-active MoE run on 8 GB VRAM at ~2 tok/s.