Practical guide · 2026-05-05

Run modern AI on your own hardware — not in 5 years, today.

On a single $300 RTX 3060 Ti — 8 GB VRAM, WSL2, 14 GB RAM — we ran DeepSeek V4-Flash (DeepSeek's 284 B-total / 13 B-active Mixture-of-Experts model, 159 GB of weights across 46 safetensors shards) end-to-end through our own native C++/CUDA inference engine. We closed it as Surgery Case 01: out-of-core inference of a 284 B-parameter MoE on a single consumer GPU. Decode: 1.86 tok/s real-weight warm → 0.16 tok/s on the full 43-layer text loop. Bottleneck is random disk I/O, not compute — our PLANCK_PACK contiguous expert layout lifted DMA throughput 6.5× (0.13 → 0.85 GB/s) and the model is still disk-bound. (Baseline check, since people ask: a Q4 7B fits trivially and runs at 83.58 tok/s on the same card — that is the floor in 2026, not a feat.)

This page is the practical reference. Real numbers, real bottlenecks, every claim sourced. Companion to /ai (the field map) and /ai-faq (the Q&A).

1 · Why run AI locally at all.

Cloud AI APIs are excellent at one thing: getting you to a working prototype in twenty minutes. They are bad at almost everything that comes after. Local AI is what you want when:

  • Privacy: prompts contain customer data, source code, medical records, or anything that should not leave the building.
  • Cost at scale: at 10M+ tokens per month, a $300 GPU pays itself back in weeks compared to a metered API.
  • Sovereignty: regulators, governments, and laboratories cannot tolerate a vendor revoking access on a Tuesday.
  • Latency: the round-trip to a US data centre is longer than the inference itself for short prompts. Local removes the network.
  • Air-gap: classified networks. Field deployments. Anything offline by design.
  • Tuning: cloud APIs do not let you fine-tune custom organs for your domain. Local does. See /surgery.

The only honest argument against local is engineering time. Setting up a runtime, picking a model, getting the first token out — historically six hours of friction. In 2026 that has dropped to about ten minutes if you use Ollama. We list every step below.

2 · The hardware floor.

Modern open-weights LLMs are quantised to 4-5 bits per weight, which collapses VRAM requirements. The current practical envelope:

VRAMWhat runsSpeed (tok/s)Card examples
4 GB1.5B–3B Q4~40–80Old GPUs, integrated GPUs
8 GB7B Q4 + 284B MoE streamed83.58 / 1.86RTX 3060 Ti, RTX 4060 Ti, RTX 4070, M2 Max
12 GB13B Q4, 7B Q8~50–70RTX 3060 12 GB, RTX 4070 Super
16 GB13B Q4 with long context~50–70RTX 4060 Ti 16 GB, RTX 4080
24 GB34B Q4, fits 70B Q3~25–45RTX 3090, RTX 4090
48 GB+70B Q5+, 120B Q4~15–30RTX 6000 Ada, A6000, dual 3090
Apple UnifiedUp to 70B Q4 on 64 GB Macs~6–25M2/M3/M4 Max/Ultra via MLX or llama.cpp Metal

The cheapest actually-usable entry point in 2026 is a used RTX 3060 Ti for around $250-280. The interesting fact about it is not that it runs 7B models (every modern card does that). The interesting fact is that via expert streaming it runs models 19× larger than its VRAM — a 284 B-parameter MoE end-to-end. We measured this on a single card; the report is at /r/V4_FLASH_TECH_BRIEF.

3 · Pick a runtime.

Ollama — the easiest path

One command installs everything. Best for "I want a working chatbot in five minutes." Wraps llama.cpp, hides quantisation choices, manages model downloads. Site: ollama.com.

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen2.5:7b   # downloads + runs in one step

llama.cpp — the workhorse

Permissive MIT, written in C++, runs on CPU/CUDA/Metal/ROCm/Vulkan. Best for "I want full control over quantisation, prompt template, and KV cache." Source: github.com/ggerganov/llama.cpp. The GGUF file format is the de-facto standard for distributing local-friendly weights.

vLLM — production throughput

UC-Berkeley engine with PagedAttention and continuous batching. Best for "I am serving many users from one GPU." Less great for laptop / single-user workflows. Site: docs.vllm.ai.

MLC-LLM, LM Studio, Jan

MLC-LLM compiles to any backend (WebGPU, Vulkan, Metal, ROCm) and runs in the browser. LM Studio and Jan are GUI front-ends that wrap llama.cpp for non-technical users.

Frankenstellm (CyberdyneLabs)

Clean-room CUDA backend in the same class as llama.cpp — single C++/CUDA binary, no PyTorch, no cuBLAS. The reason it exists is that we wanted a runtime we wrote ourselves, with a hologram cache (860× speedup on identical prompts), a Black-Dog conductance router, and a hot-expert cache for MoE streaming. See /frankenstellm and /r/V4_FLASH_TECH_BRIEF.

4 · Pick a model.

For first-time local AI, download Qwen 2.5 7B Instruct Q4_K_M (~4.5 GB). It is permissively licensed (Apache 2.0), well-supported across runtimes, multilingual, and the strongest 7B-class general assistant in 2026.

Beyond the first model, the practical 2026 family map:

  • Qwen 2.5 (Alibaba, Apache 2.0) — 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B. Strongest general assistant per parameter. Donor for our surgery work.
  • Llama 3 / 3.1 / 3.3 (Meta, custom permissive) — 1B / 3B / 8B / 70B / 405B. Heavy reasoning, English-leaning. Custom license — read it.
  • Mistral / Mixtral (Mistral AI, Apache 2.0 except premier) — 7B Mistral, 8×7B and 8×22B Mixtral MoE, Codestral. Strong code, fast inference.
  • DeepSeek V3 / V4-Flash (DeepSeek, custom) — large MoE families with very high capability/cost. V4-Flash is 284B total / 13B active, runs on 8 GB via expert streaming.
  • Gemma 2 / Gemma 3 (Google, Gemma license) — 2B / 9B / 27B. Strong instruction-following, restrictive license — read before commercial.
  • Phi-4 (Microsoft, MIT) — 14B with strong reasoning per parameter. Excellent for code and math.

For specialised tasks the field is wider: Codestral / DeepSeek-Coder for code, Qwen2.5-Coder 32B for code-with-reasoning, Whisper for ASR, SDXL / Flux for images, OpenVoice / XTTS for TTS, and Gemma-2-2B / Phi-3 mini for tiny resource budgets.

5 · Download the weights.

Most open-weights models are released through Hugging Face. The original release is usually in safetensors (PyTorch). Quantised GGUF mirrors are produced by community packagers (TheBloke historically, the model authors themselves more recently). For a runtime that uses GGUF, you just need the .gguf file.

# Direct download (ungated public model)
curl -L -o qwen2.5-7b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"

# Or via Ollama (handles everything)
ollama pull qwen2.5:7b

For gated models (Llama 3, some DeepSeek tiers) you need a Hugging Face account and to accept the license once on the model page.

6 · First inference.

# Ollama
ollama run qwen2.5:7b
>>> What is the working definition of artificial intelligence?

# llama.cpp
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
            -p "What is the working definition of artificial intelligence?" \
            -n 256 -ngl 99

# llama.cpp server (OpenAI-compatible API on localhost:8080)
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf -ngl 99
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"hello"}]}'

The -ngl 99 flag offloads as many layers as fit to GPU. On an RTX 3060 Ti running Qwen 2.5 7B Q4_K_M, the entire model fits — set it to 99 and forget it. On smaller VRAM, the runtime gracefully partials to CPU.

7 · Tuning for speed.

Most defaults are already good. The four tunes that matter:

  • DP4A int8 matmul on Turing+ cards. If you are on RTX 20-series or newer, build with the int8 dot-product flag enabled. CyberdyneLabs Frankenstellm (binary: gigachad_native) sees +59% on this single change (18.27 → 28.99 tok/s on 7B Q4).
  • Batch ≥ 1. Many runtimes have tg128 or similar flags that batch token-generation across multiple sequences. We see 41.69 tok/s on tg128 — vs 28.99 single-sequence — on the same hardware.
  • Hologram cache. Identical prompt repeats are common in agentic loops and chat. Caching the full response keyed on prompt hash takes the response time from 860 ms to 1 ms (860×). Frankenstellm ships this; llama.cpp does not natively but can be wrapped.
  • KV-cache offload. For long contexts (32k+) the KV cache may dominate VRAM. Offload it to CPU memory or, on Apple Silicon, to unified memory.

8 · Very large MoE on small VRAM.

The technique that makes a 284 B-total MoE fit on 8 GB is expert streaming: at each layer of an MoE model, only the active expert sub-networks for the current token are loaded from disk into VRAM. Inactive experts stay on disk. The bottleneck shifts from compute to I/O. The honest finding from V4-Flash: on consumer SSDs the disk side does not fully keep up — even after our PLANCK_PACK contiguous expert layout brought DMA throughput from 0.13 GB/s to 0.85 GB/s (6.5×), the full 43-layer text loop runs at 0.16 tok/s, with ~13.6 GB streamed per 8-token run. Synthetic warm decode where most experts hit cache: 1.86 tok/s. The path to faster is route-aware expert eviction and a cache-aware hot list, not bigger compute.

Model
284 B
DeepSeek V4-Flash MoE
VRAM used
~7 GB
Hot experts + KV + scaffolding
Disk used
159 GB
46 safetensors shards (~69 k tensors)
Warm decode
1.86 tok/s
real-weight chain, ~14× Python warm
Full 43-layer text
0.16 tok/s
DMA-bound, ~8.59 GB VRAM
PLANCK_PACK gain
6.5×
DMA: 0.13 → 0.85 GB/s

We are honest about the bottleneck: 89 % of decode wall-time is expert IO. This is not "GPT-4 on a gamer card." It is a working demonstration that you can run a model whose dense equivalent would need 300+ GB of HBM on a single $300 GPU, with a real bottleneck that is solvable with NVMe RAID and prefetch. The PLANCK pack format that backs this — mmap-able, byte-verified — is documented at /glossary#planck.

The lesson. "Impossible on consumer hardware" claims about LLMs in 2024 turned out to be wrong by 2026. Quantisation, expert streaming, KV-cache compression, hologram caching, and clean-room kernel work compound. The 8 GB consumer GPU running a 284 B-total / 13 B-active MoE model is the data point. Treat all future "impossible" claims with the same suspicion.

9 · Cost — local versus cloud.

ApproachSetup costPer-million-tok costBreak-even
OpenAI gpt-4o$0~$5–15 in / $15–60 outIdeal for <1M tok/month
Anthropic claude-sonnet-4$0~$3–15 in / $15–75 outSame as above
Together / Anyscale (Llama 70B)$0~$0.6–3Mid-volume sweet spot
Local — RTX 3060 Ti, 7B$280~$0.005 (electricity only)Pays back at ~50–100M tok
Local — RTX 4090, 70B Q4$1500~$0.02Pays back at ~100M–1B tok

Caveats: cloud pricing changes constantly, electricity costs vary by region, the labour to run local has a non-zero hourly cost. The honest summary: if you are doing real volume (more than ~10 million tokens per month) with non-trivial privacy needs, local wins on cost and sovereignty. Below that threshold, cloud APIs are cheaper than the engineering hours.

10 · Our benchmarks.

Every number on this page comes from our own measurements. Reproducibility commands are in the linked report files. License: CC-BY-SA 4.0.

TestHardwareResultSource
Q4 7B (Physarium-7B)RTX 3060 Ti83.58 tok/s/r/CURRENT_TRUTH_LEDGER
Q4 7B native defaultRTX 3060 Ti18.27 tok/s/r/CURRENT_TRUTH_LEDGER
Q4 7B + DP4ARTX 3060 Ti28.99 tok/s (+59 %)/r/CURRENT_TRUTH_LEDGER
Q4 7B DP4A tg128RTX 3060 Ti41.69 tok/s/r/CURRENT_TRUTH_LEDGER
Hologram cache hitRTX 3060 Ti860 ms → 1 ms (860×)/r/CURRENT_TRUTH_LEDGER
DeepSeek V4-Flash 284B / 13B active (Surgery Case 01)RTX 3060 Ti, 8 GB · WSL2 · 14 GB RAM1.86 → 0.16 tok/s/r/V4_FLASH_TECH_BRIEF (159 GB / 46 shards / disk-I/O-bound)
HumanEval Mode-CFrankenstellm 7B + organs81/164/r/CURRENT_TRUTH_LEDGER
MBPP Mode-CFrankenstellm 7B + organs60/100/r/CURRENT_TRUTH_LEDGER
ARIZ TRIZ strict JSON0.5B organ88/100/r/BD7_TRIZ_SURGERY_FINAL
BD9 json_repair organ0.5B organ10/10 GREEN/r/BD9_JSON_REPAIR_FINAL

The full reports archive (66 dated reports, CC-BY-SA 4.0) is at /r/. The doctrine pack (24 documents including the truth-ledger format) is at /downloads.