Surgery — CyberdyneLabs

About

We treat models as anatomy.

The world has produced extraordinary open foundation models, and almost no rigorous practice for shaping them after release. Our laboratory exists to close that gap.

Take an artefact like DeepSeek V4-Flash — 284 billion parameters in total, 13 billion of them active for any given token, an open-weight Mixture-of-Experts model trained on more than 32 trillion tokens, with a context window of one million. A frontier-grade artefact distributed under an MIT license. The standard answer to making such a model behave the way you need it to is fine-tuning — and fine-tuning, in practice, is hope dressed in machinery. You feed in data; you wait; you ship whatever comes out; you cannot say with certainty what you changed.

Surgery operates differently. We treat the weight space as anatomy. We open the model up, we identify the structures responsible for a behaviour we want to change, and we modify them — locally, measurably, reversibly. Where the field at large grows models by accumulation, we refine them by intervention.

The work spans every layer of the stack. The compression formats that let very large models run on small machines. The native compiled runtime that replaces the research-grade scripting most laboratories ship with. The internal memory that gives a refined model something resembling continuity. The verifier that tells us, on every output, whether the answer is supported. None of this exists in isolation. All of it serves the same goal: to produce systems whose behaviour we understand and whose claims we can defend.

The Stack

Six systems. One operating theatre.

01

The Brain

A central reasoning model — refined, not retrained.

At the centre of every system we ship sits a single high-capacity reasoning model, derived from an open foundation through targeted intervention. It carries the breadth of a frontier model and the discipline of a system whose behaviour was shaped one circuit at a time.

Class Top-level reasoning Origin Open foundation

02

The Organs

A farm of small specialists.

Around the central model sits a fleet of compact specialists — each one a sub-billion-parameter expert at a single narrow function: structured output, code skeleton, claim extraction, contradiction analysis. They run cheaply, in parallel, and answer to the model above them.

Population 5+ specialists Each < 1B parameters

03

The Memory

A structured spine of persistent recall.

A reasoning system without memory begins every encounter from zero. Ours does not. A structured archive of hundreds of indexed volumes, addressable down to the line, gives the assembled system something other models lack: a continuous record of what it has thought, what it has been told, and where each fact came from.

Volumes 578 indexed Recall Volume / line precise

04

The Bloodstream

A routing field that learns its own paths.

Every request flows through a self-organising routing layer that decides which organs to wake, which memories to retrieve, and which paths to reinforce or starve. It is, in effect, a circulatory system — quiet, adaptive, and the reason the assembled body responds as one.

Substrate Adaptive routing Property Self-pruning

05

The Verifier

A hard gate against fabrication.

No claim leaves the system without passing a strict verifier. Every assertion that references memory must carry a pointer to its source. Every output that cannot be backed up is flagged as such, in plain language, before it reaches the user. The default, in our laboratory, is suspicion.

Output Source-pointed Default Skeptical

06

The Body

A native compiled runtime.

Most laboratories ship Python. We ship a compiled native runtime. Memory, model loading, attention kernels, tiering between fast and slow storage — all of it written in low-level systems code. The result is a complete model deployed on a single consumer GPU: less hardware, less latency, no scripting language between the model and the machine.

Language Native compiled Target Single consumer GPU

Selected Work

What this laboratory has done.

CASE 01

An open frontier-grade model, deployed on a single consumer GPU.

We took DeepSeek V4-Flash — an open-weight Mixture-of-Experts foundation model with 284 billion parameters in total, 13 billion active per token, a one-million-token context window, an MIT license, and the kind of inference profile that ordinarily requires multiple data-centre-class GPUs to operate — and we refined it down to a form that runs end-to-end on consumer hardware.

The refinement preserved the model's reasoning and removed almost everything else. The resulting system reasons through full-length tasks on a single workstation, behind a corporate firewall, without remote dependency. What the operation showed us, more than the deployment itself, was that the bottleneck of large-model inference is not the reasoning at all. It is the cost of moving the model's specialised parts in and out of memory.

SubjectDeepSeek V4-Flash

Total / active284 B / 13 B

Context window1 M tokens

Hardware target1 × consumer GPU

CASE 02

A small specialist with one fifth of its weight removed — and no measurable loss.

We performed targeted excision on a sub-billion-parameter open foundation model — specifically Qwen 2.5 0.5B, used as donor tissue for the operation. Roughly twenty per cent of its weights were removed, guided by an internal signal that identifies parameters with no measurable effect on the model's behaviour. Output quality remained within noise of the original. Throughput was preserved.

This was, for us, the proof of principle. A model is not a single inseparable thing. It is a structure with healthy and dead tissue, and the healthy tissue can be isolated.

DonorQwen 2.5 0.5B

Weights removed≈ 20 %

Quality driftWithin noise

ThroughputPreserved

CASE 03

A custom expert-streaming format that gave the inference loop a six-fold speed-up.

The single largest cost in operating a model the size of DeepSeek V4-Flash on consumer hardware is not arithmetic. It is the choreography of moving the model's hundreds of specialised experts in and out of working memory, again and again, on every forward pass.

By reorganising the way these specialised parts are packed and streamed from disk, we turned the most expensive operation in the inference pipeline into a tractable one. The same model, on the same hardware, ran roughly six times faster on its decode loop. The format is internal, instrumented, and reproducible. We use it now as the substrate for everything else.

BottleneckExpert streaming

Speed-up≈ 6 ×

HardwareUnchanged

StatusIn production

CASE 04

A persistent memory of hundreds of structured volumes.

We built, for the assembled system, a structured archive of hundreds of indexed volumes: long-form raw material, micro-notes, decision logs, retrieval scaffolds. Every retrieval is auditable. Every claim that references a memory must carry a pointer to a specific volume and line, or it is flagged as unverified.

It is not a chat history and it is not a vector store. It is the spine of a system that, for the first time, can be asked where it learned that — and answer.

Volumes578

Micro-records783

Decision log366 entries

RetrievalVolume / line precise

Reports

Selected publications.

Technical Report

Out-of-core inference of DeepSeek V4-Flash on a single consumer GPU.

CyberdyneLabs · Surgery2026

Methods Note

Targeted weight excision: removing dead tissue from sub-billion-parameter models.

CyberdyneLabs · Surgery2026

System Brief

The native runtime: replacing scripted inference with compiled systems code.

CyberdyneLabs · Surgery2026

Methods Note

Memory as anatomy: a structured spine for persistent recall.

CyberdyneLabs · Surgery2026

Forthcoming

Verification primitives for refined foundation models.

Forthcoming

The routing field: adaptive specialist coordination.

Open Tools

Engines we will open up.

Native runtime

The compiled inference engine.

The native, compiled inference loop that powers every refined model we ship. Memory tiering, attention kernels, expert streaming — written in low-level systems code, instrumented end-to-end.

In active use

Surgery toolkit

The model operating tools.

The set of utilities a researcher uses to perform model surgery — locate dead tissue, perform targeted excision, graft new behaviour, validate the result. Designed to be used by humans, audited by humans.

In active use

Memory spine

The structured archive.

The persistent memory layer — indexed volumes, micro-records, decision log. Volume-and-line precise retrieval, with provenance preserved at every level.

Under refinement

Verifier

The hard gate.

The strict verification layer that sits in front of every output. Source pointers, unverified-claim flagging, structured assertions. Refuses what it cannot back up.

In integration

Principles

How this laboratory operates.

01

Operate, don't retrain.

Where the field grows models by accumulation, we refine them by intervention. Local, targeted, measurable. We change the smallest set of parameters that produces the change we want, and we know which ones.

02

Compile what you ship.

Research code belongs in the laboratory. Production systems belong in compiled native code. The translation is not optional and it is not a future concern. It is the work.

03

Skeptical by default.

No claim leaves the system without a pointer to evidence. No memory is trusted without provenance. No operation is shipped without a reproducible benchmark. The default in this laboratory is doubt.

Want to see the other programs?

Back to all programs→

Surgery.

A central reasoning model — refined, not retrained.

A farm of small specialists.

A structured spine of persistent recall.

A routing field that learns its own paths.

A hard gate against fabrication.

A native compiled runtime.

An open frontier-grade model, deployed on a single consumer GPU.

A small specialist with one fifth of its weight removed — and no measurable loss.

A custom expert-streaming format that gave the inference loop a six-fold speed-up.

A persistent memory of hundreds of structured volumes.

Out-of-core inference of DeepSeek V4-Flash on a single consumer GPU.

Targeted weight excision: removing dead tissue from sub-billion-parameter models.

The native runtime: replacing scripted inference with compiled systems code.

Memory as anatomy: a structured spine for persistent recall.

Verification primitives for refined foundation models.

The routing field: adaptive specialist coordination.

The compiled inference engine.

The model operating tools.

The structured archive.

The hard gate.

Operate, don't retrain.

Compile what you ship.

Skeptical by default.

Want to see the other programs?