CyberdyneLabs Program 01 / 05

Surgery.

A laboratory for the precise refinement of large language models — where models are not trained, but operated upon.

Established 2026 · Status Active · Native runtime Online
SUBJECT DeepSeek V4-Flash · 284B / 13B active · MoE · 1M ctx PROCEDURE Targeted distillation · expert streaming
OPERATING THEATRE · LIVE
VIEW · 3D

About

We treat models as anatomy.

The world has produced extraordinary open foundation models, and almost no rigorous practice for shaping them after release. Our laboratory exists to close that gap.

Take an artefact like DeepSeek V4-Flash — 284 billion parameters in total, 13 billion of them active for any given token, an open-weight Mixture-of-Experts model trained on more than 32 trillion tokens, with a context window of one million. A frontier-grade artefact distributed under an MIT license. The standard answer to making such a model behave the way you need it to is fine-tuning — and fine-tuning, in practice, is hope dressed in machinery. You feed in data; you wait; you ship whatever comes out; you cannot say with certainty what you changed.

Surgery operates differently. We treat the weight space as anatomy. We open the model up, we identify the structures responsible for a behaviour we want to change, and we modify them — locally, measurably, reversibly. Where the field at large grows models by accumulation, we refine them by intervention.

The work spans every layer of the stack. The compression formats that let very large models run on small machines. The native compiled runtime that replaces the research-grade scripting most laboratories ship with. The internal memory that gives a refined model something resembling continuity. The verifier that tells us, on every output, whether the answer is supported. None of this exists in isolation. All of it serves the same goal: to produce systems whose behaviour we understand and whose claims we can defend.

The Stack

Six systems. One operating theatre.

01
The Brain

A central reasoning model — refined, not retrained.

At the centre of every system we ship sits a single high-capacity reasoning model, derived from an open foundation through targeted intervention. It carries the breadth of a frontier model and the discipline of a system whose behaviour was shaped one circuit at a time.

Class Top-level reasoning Origin Open foundation
02
The Organs

A farm of small specialists.

Around the central model sits a fleet of compact specialists — each one a sub-billion-parameter expert at a single narrow function: structured output, code skeleton, claim extraction, contradiction analysis. They run cheaply, in parallel, and answer to the model above them.

Population 5+ specialists Each < 1B parameters
03
The Memory

A structured spine of persistent recall.

A reasoning system without memory begins every encounter from zero. Ours does not. A structured archive of hundreds of indexed volumes, addressable down to the line, gives the assembled system something other models lack: a continuous record of what it has thought, what it has been told, and where each fact came from.

Volumes 578 indexed Recall Volume / line precise
04
The Bloodstream

A routing field that learns its own paths.

Every request flows through a self-organising routing layer that decides which organs to wake, which memories to retrieve, and which paths to reinforce or starve. It is, in effect, a circulatory system — quiet, adaptive, and the reason the assembled body responds as one.

Substrate Adaptive routing Property Self-pruning
05
The Verifier

A hard gate against fabrication.

No claim leaves the system without passing a strict verifier. Every assertion that references memory must carry a pointer to its source. Every output that cannot be backed up is flagged as such, in plain language, before it reaches the user. The default, in our laboratory, is suspicion.

Output Source-pointed Default Skeptical
06
The Body

A native compiled runtime.

Most laboratories ship Python. We ship a compiled native runtime. Memory, model loading, attention kernels, tiering between fast and slow storage — all of it written in low-level systems code. The result is a complete model deployed on a single consumer GPU: less hardware, less latency, no scripting language between the model and the machine.

Language Native compiled Target Single consumer GPU

Selected Work

What this laboratory has done.

CASE 01

An open frontier-grade model, deployed on a single consumer GPU.

We took DeepSeek V4-Flash — an open-weight Mixture-of-Experts foundation model with 284 billion parameters in total, 13 billion active per token, a one-million-token context window, an MIT license, and the kind of inference profile that ordinarily requires multiple data-centre-class GPUs to operate — and we refined it down to a form that runs end-to-end on consumer hardware.

The refinement preserved the model's reasoning and removed almost everything else. The resulting system reasons through full-length tasks on a single workstation, behind a corporate firewall, without remote dependency. What the operation showed us, more than the deployment itself, was that the bottleneck of large-model inference is not the reasoning at all. It is the cost of moving the model's specialised parts in and out of memory.

SubjectDeepSeek V4-Flash
Total / active284 B / 13 B
Context window1 M tokens
Hardware target1 × consumer GPU
CASE 02

A small specialist with one fifth of its weight removed — and no measurable loss.

We performed targeted excision on a sub-billion-parameter open foundation model — specifically Qwen 2.5 0.5B, used as donor tissue for the operation. Roughly twenty per cent of its weights were removed, guided by an internal signal that identifies parameters with no measurable effect on the model's behaviour. Output quality remained within noise of the original. Throughput was preserved.

This was, for us, the proof of principle. A model is not a single inseparable thing. It is a structure with healthy and dead tissue, and the healthy tissue can be isolated.

DonorQwen 2.5 0.5B
Weights removed≈ 20 %
Quality driftWithin noise
ThroughputPreserved
CASE 03

A custom expert-streaming format that gave the inference loop a six-fold speed-up.

The single largest cost in operating a model the size of DeepSeek V4-Flash on consumer hardware is not arithmetic. It is the choreography of moving the model's hundreds of specialised experts in and out of working memory, again and again, on every forward pass.

By reorganising the way these specialised parts are packed and streamed from disk, we turned the most expensive operation in the inference pipeline into a tractable one. The same model, on the same hardware, ran roughly six times faster on its decode loop. The format is internal, instrumented, and reproducible. We use it now as the substrate for everything else.

BottleneckExpert streaming
Speed-up≈ 6 ×
HardwareUnchanged
StatusIn production
CASE 04

A persistent memory of hundreds of structured volumes.

We built, for the assembled system, a structured archive of hundreds of indexed volumes: long-form raw material, micro-notes, decision logs, retrieval scaffolds. Every retrieval is auditable. Every claim that references a memory must carry a pointer to a specific volume and line, or it is flagged as unverified.

It is not a chat history and it is not a vector store. It is the spine of a system that, for the first time, can be asked where it learned that — and answer.

Volumes578
Micro-records783
Decision log366 entries
RetrievalVolume / line precise

Open Tools

Engines we will open up.

Native runtime

The compiled inference engine.

The native, compiled inference loop that powers every refined model we ship. Memory tiering, attention kernels, expert streaming — written in low-level systems code, instrumented end-to-end.

In active use
Surgery toolkit

The model operating tools.

The set of utilities a researcher uses to perform model surgery — locate dead tissue, perform targeted excision, graft new behaviour, validate the result. Designed to be used by humans, audited by humans.

In active use
Memory spine

The structured archive.

The persistent memory layer — indexed volumes, micro-records, decision log. Volume-and-line precise retrieval, with provenance preserved at every level.

Under refinement
Verifier

The hard gate.

The strict verification layer that sits in front of every output. Source pointers, unverified-claim flagging, structured assertions. Refuses what it cannot back up.

In integration

Principles

How this laboratory operates.

01

Operate, don't retrain.

Where the field grows models by accumulation, we refine them by intervention. Local, targeted, measurable. We change the smallest set of parameters that produces the change we want, and we know which ones.

02

Compile what you ship.

Research code belongs in the laboratory. Production systems belong in compiled native code. The translation is not optional and it is not a future concern. It is the work.

03

Skeptical by default.

No claim leaves the system without a pointer to evidence. No memory is trusted without provenance. No operation is shipped without a reproducible benchmark. The default in this laboratory is doubt.

Want to see the other programs?

Back to all programs