Practical guide · 2026-05-05
Run modern AI on your own hardware — not in 5 years, today.
On a single $300 RTX 3060 Ti — 8 GB VRAM, WSL2, 14 GB RAM — we ran DeepSeek V4-Flash (DeepSeek's 284 B-total / 13 B-active Mixture-of-Experts model, 159 GB of weights across 46 safetensors shards) end-to-end through our own native C++/CUDA inference engine. We closed it as Surgery Case 01: out-of-core inference of a 284 B-parameter MoE on a single consumer GPU. Decode: 1.86 tok/s real-weight warm → 0.16 tok/s on the full 43-layer text loop. Bottleneck is random disk I/O, not compute — our PLANCK_PACK contiguous expert layout lifted DMA throughput 6.5× (0.13 → 0.85 GB/s) and the model is still disk-bound. (Baseline check, since people ask: a Q4 7B fits trivially and runs at 83.58 tok/s on the same card — that is the floor in 2026, not a feat.)
This page is the practical reference. Real numbers, real bottlenecks, every claim sourced. Companion to /ai (the field map) and /ai-faq (the Q&A).
1 · Why run AI locally at all.
Cloud AI APIs are excellent at one thing: getting you to a working prototype in twenty minutes. They are bad at almost everything that comes after. Local AI is what you want when:
- Privacy: prompts contain customer data, source code, medical records, or anything that should not leave the building.
- Cost at scale: at 10M+ tokens per month, a $300 GPU pays itself back in weeks compared to a metered API.
- Sovereignty: regulators, governments, and laboratories cannot tolerate a vendor revoking access on a Tuesday.
- Latency: the round-trip to a US data centre is longer than the inference itself for short prompts. Local removes the network.
- Air-gap: classified networks. Field deployments. Anything offline by design.
- Tuning: cloud APIs do not let you fine-tune custom organs for your domain. Local does. See /surgery.
The only honest argument against local is engineering time. Setting up a runtime, picking a model, getting the first token out — historically six hours of friction. In 2026 that has dropped to about ten minutes if you use Ollama. We list every step below.
2 · The hardware floor.
Modern open-weights LLMs are quantised to 4-5 bits per weight, which collapses VRAM requirements. The current practical envelope:
| VRAM | What runs | Speed (tok/s) | Card examples |
|---|---|---|---|
| 4 GB | 1.5B–3B Q4 | ~40–80 | Old GPUs, integrated GPUs |
| 8 GB | 7B Q4 + 284B MoE streamed | 83.58 / 1.86 | RTX 3060 Ti, RTX 4060 Ti, RTX 4070, M2 Max |
| 12 GB | 13B Q4, 7B Q8 | ~50–70 | RTX 3060 12 GB, RTX 4070 Super |
| 16 GB | 13B Q4 with long context | ~50–70 | RTX 4060 Ti 16 GB, RTX 4080 |
| 24 GB | 34B Q4, fits 70B Q3 | ~25–45 | RTX 3090, RTX 4090 |
| 48 GB+ | 70B Q5+, 120B Q4 | ~15–30 | RTX 6000 Ada, A6000, dual 3090 |
| Apple Unified | Up to 70B Q4 on 64 GB Macs | ~6–25 | M2/M3/M4 Max/Ultra via MLX or llama.cpp Metal |
The cheapest actually-usable entry point in 2026 is a used RTX 3060 Ti for around $250-280. The interesting fact about it is not that it runs 7B models (every modern card does that). The interesting fact is that via expert streaming it runs models 19× larger than its VRAM — a 284 B-parameter MoE end-to-end. We measured this on a single card; the report is at /r/V4_FLASH_TECH_BRIEF.
3 · Pick a runtime.
Ollama — the easiest path
One command installs everything. Best for "I want a working chatbot in five minutes." Wraps llama.cpp, hides quantisation choices, manages model downloads. Site: ollama.com.
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen2.5:7b # downloads + runs in one step
llama.cpp — the workhorse
Permissive MIT, written in C++, runs on CPU/CUDA/Metal/ROCm/Vulkan. Best for "I want full control over quantisation, prompt template, and KV cache." Source: github.com/ggerganov/llama.cpp. The GGUF file format is the de-facto standard for distributing local-friendly weights.
vLLM — production throughput
UC-Berkeley engine with PagedAttention and continuous batching. Best for "I am serving many users from one GPU." Less great for laptop / single-user workflows. Site: docs.vllm.ai.
MLC-LLM, LM Studio, Jan
MLC-LLM compiles to any backend (WebGPU, Vulkan, Metal, ROCm) and runs in the browser. LM Studio and Jan are GUI front-ends that wrap llama.cpp for non-technical users.
Frankenstellm (CyberdyneLabs)
Clean-room CUDA backend in the same class as llama.cpp — single C++/CUDA binary, no PyTorch, no cuBLAS. The reason it exists is that we wanted a runtime we wrote ourselves, with a hologram cache (860× speedup on identical prompts), a Black-Dog conductance router, and a hot-expert cache for MoE streaming. See /frankenstellm and /r/V4_FLASH_TECH_BRIEF.
4 · Pick a model.
For first-time local AI, download Qwen 2.5 7B Instruct Q4_K_M (~4.5 GB). It is permissively licensed (Apache 2.0), well-supported across runtimes, multilingual, and the strongest 7B-class general assistant in 2026.
Beyond the first model, the practical 2026 family map:
- Qwen 2.5 (Alibaba, Apache 2.0) — 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B. Strongest general assistant per parameter. Donor for our surgery work.
- Llama 3 / 3.1 / 3.3 (Meta, custom permissive) — 1B / 3B / 8B / 70B / 405B. Heavy reasoning, English-leaning. Custom license — read it.
- Mistral / Mixtral (Mistral AI, Apache 2.0 except premier) — 7B Mistral, 8×7B and 8×22B Mixtral MoE, Codestral. Strong code, fast inference.
- DeepSeek V3 / V4-Flash (DeepSeek, custom) — large MoE families with very high capability/cost. V4-Flash is 284B total / 13B active, runs on 8 GB via expert streaming.
- Gemma 2 / Gemma 3 (Google, Gemma license) — 2B / 9B / 27B. Strong instruction-following, restrictive license — read before commercial.
- Phi-4 (Microsoft, MIT) — 14B with strong reasoning per parameter. Excellent for code and math.
For specialised tasks the field is wider: Codestral / DeepSeek-Coder for code, Qwen2.5-Coder 32B for code-with-reasoning, Whisper for ASR, SDXL / Flux for images, OpenVoice / XTTS for TTS, and Gemma-2-2B / Phi-3 mini for tiny resource budgets.
5 · Download the weights.
Most open-weights models are released through Hugging Face. The original release is usually in safetensors (PyTorch). Quantised GGUF mirrors are produced by community packagers (TheBloke historically, the model authors themselves more recently). For a runtime that uses GGUF, you just need the .gguf file.
# Direct download (ungated public model)
curl -L -o qwen2.5-7b-instruct-q4_k_m.gguf \
"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"
# Or via Ollama (handles everything)
ollama pull qwen2.5:7b
For gated models (Llama 3, some DeepSeek tiers) you need a Hugging Face account and to accept the license once on the model page.
6 · First inference.
# Ollama
ollama run qwen2.5:7b
>>> What is the working definition of artificial intelligence?
# llama.cpp
./llama-cli -m qwen2.5-7b-instruct-q4_k_m.gguf \
-p "What is the working definition of artificial intelligence?" \
-n 256 -ngl 99
# llama.cpp server (OpenAI-compatible API on localhost:8080)
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf -ngl 99
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"hello"}]}'
The -ngl 99 flag offloads as many layers as fit to GPU. On an RTX 3060 Ti running Qwen 2.5 7B Q4_K_M, the entire model fits — set it to 99 and forget it. On smaller VRAM, the runtime gracefully partials to CPU.
7 · Tuning for speed.
Most defaults are already good. The four tunes that matter:
- DP4A int8 matmul on Turing+ cards. If you are on RTX 20-series or newer, build with the int8 dot-product flag enabled. CyberdyneLabs Frankenstellm (binary: gigachad_native) sees +59% on this single change (18.27 → 28.99 tok/s on 7B Q4).
- Batch ≥ 1. Many runtimes have
tg128or similar flags that batch token-generation across multiple sequences. We see 41.69 tok/s on tg128 — vs 28.99 single-sequence — on the same hardware. - Hologram cache. Identical prompt repeats are common in agentic loops and chat. Caching the full response keyed on prompt hash takes the response time from 860 ms to 1 ms (860×). Frankenstellm ships this; llama.cpp does not natively but can be wrapped.
- KV-cache offload. For long contexts (32k+) the KV cache may dominate VRAM. Offload it to CPU memory or, on Apple Silicon, to unified memory.
8 · Very large MoE on small VRAM.
The technique that makes a 284 B-total MoE fit on 8 GB is expert streaming: at each layer of an MoE model, only the active expert sub-networks for the current token are loaded from disk into VRAM. Inactive experts stay on disk. The bottleneck shifts from compute to I/O. The honest finding from V4-Flash: on consumer SSDs the disk side does not fully keep up — even after our PLANCK_PACK contiguous expert layout brought DMA throughput from 0.13 GB/s to 0.85 GB/s (6.5×), the full 43-layer text loop runs at 0.16 tok/s, with ~13.6 GB streamed per 8-token run. Synthetic warm decode where most experts hit cache: 1.86 tok/s. The path to faster is route-aware expert eviction and a cache-aware hot list, not bigger compute.
We are honest about the bottleneck: 89 % of decode wall-time is expert IO. This is not "GPT-4 on a gamer card." It is a working demonstration that you can run a model whose dense equivalent would need 300+ GB of HBM on a single $300 GPU, with a real bottleneck that is solvable with NVMe RAID and prefetch. The PLANCK pack format that backs this — mmap-able, byte-verified — is documented at /glossary#planck.
9 · Cost — local versus cloud.
| Approach | Setup cost | Per-million-tok cost | Break-even |
|---|---|---|---|
| OpenAI gpt-4o | $0 | ~$5–15 in / $15–60 out | Ideal for <1M tok/month |
| Anthropic claude-sonnet-4 | $0 | ~$3–15 in / $15–75 out | Same as above |
| Together / Anyscale (Llama 70B) | $0 | ~$0.6–3 | Mid-volume sweet spot |
| Local — RTX 3060 Ti, 7B | $280 | ~$0.005 (electricity only) | Pays back at ~50–100M tok |
| Local — RTX 4090, 70B Q4 | $1500 | ~$0.02 | Pays back at ~100M–1B tok |
Caveats: cloud pricing changes constantly, electricity costs vary by region, the labour to run local has a non-zero hourly cost. The honest summary: if you are doing real volume (more than ~10 million tokens per month) with non-trivial privacy needs, local wins on cost and sovereignty. Below that threshold, cloud APIs are cheaper than the engineering hours.
10 · Our benchmarks.
Every number on this page comes from our own measurements. Reproducibility commands are in the linked report files. License: CC-BY-SA 4.0.
| Test | Hardware | Result | Source |
|---|---|---|---|
| Q4 7B (Physarium-7B) | RTX 3060 Ti | 83.58 tok/s | /r/CURRENT_TRUTH_LEDGER |
| Q4 7B native default | RTX 3060 Ti | 18.27 tok/s | /r/CURRENT_TRUTH_LEDGER |
| Q4 7B + DP4A | RTX 3060 Ti | 28.99 tok/s (+59 %) | /r/CURRENT_TRUTH_LEDGER |
| Q4 7B DP4A tg128 | RTX 3060 Ti | 41.69 tok/s | /r/CURRENT_TRUTH_LEDGER |
| Hologram cache hit | RTX 3060 Ti | 860 ms → 1 ms (860×) | /r/CURRENT_TRUTH_LEDGER |
| DeepSeek V4-Flash 284B / 13B active (Surgery Case 01) | RTX 3060 Ti, 8 GB · WSL2 · 14 GB RAM | 1.86 → 0.16 tok/s | /r/V4_FLASH_TECH_BRIEF (159 GB / 46 shards / disk-I/O-bound) |
| HumanEval Mode-C | Frankenstellm 7B + organs | 81/164 | /r/CURRENT_TRUTH_LEDGER |
| MBPP Mode-C | Frankenstellm 7B + organs | 60/100 | /r/CURRENT_TRUTH_LEDGER |
| ARIZ TRIZ strict JSON | 0.5B organ | 88/100 | /r/BD7_TRIZ_SURGERY_FINAL |
| BD9 json_repair organ | 0.5B organ | 10/10 GREEN | /r/BD9_JSON_REPAIR_FINAL |
The full reports archive (66 dated reports, CC-BY-SA 4.0) is at /r/. The doctrine pack (24 documents including the truth-ledger format) is at /downloads.