Self-hosted local AI · systems architecture

Offline, Self-Hosted AI Coding Agent

An air-gapped 'AI employee' that reads code, runs commands, and remembers what it learns — on a single GPU, at $0 per request.

AI ArchitectureLocal LLMOllamaQwen 3RAG / Vector Memory
$0.00
runtime cost per request after one-time hardware setup
100% local
air-gapped — code & data never leave the machine
32B / 14B
Qwen 3 models on a single 16GB-VRAM GPU

The problem

Cloud AI assistants are powerful but carry three costs that disqualify them for sensitive work: source code and business data leave the machine, every request is billed, and they stop working without a connection. Teams handling private code need agentic capability that runs entirely on their own hardware.

The approach

  • Designed a six-layer local stack with a single signal path: Open WebUI (interface) → an agent framework (OpenHands / Aider) that calls real tools → Qwen 3 (reasoning) → Ollama (GPU runtime) → Qdrant (vector memory) — all open-source and self-hosted.
  • Gave the agent real hands via tool/MCP integrations: read and edit code, run terminal commands, work with Git, process Excel/CSV, and query Postgres/SQLite — not chat-only output.
  • Added durable memory: Qdrant holds indexed code, docs, and learned decisions/fixes, so the agent keeps context across sessions and builds a model of how a project fits together.
  • Specified three hardware tiers (CPU-only 4B → RTX 3060/4060 Ti 14B → RTX 4090 32B) so the identical stack scales from a junior-dev laptop to a workstation — only the model size changes.

The outcome

  • A working, reproducible build: after a one-time setup the agent runs air-gapped, charges nothing per request, and keeps functioning with no network at all.
  • Documented end to end as a step-by-step developer build guide — driver setup, MCP wiring, model selection per tier, and the real troubleshooting issues hit along the way.
  • Reference configuration (RTX 4060 Ti 16GB, Ubuntu 24.04, Qwen 3 14B) runs the full agent stack — model, tools, and vector memory — on commodity hardware.

Honest note: A self-directed architecture and build guide validated on a reference machine. Throughput depends on hardware (a few tokens/sec on CPU-only, fast on a mid-range GPU); 32B on 16GB VRAM is tight (quantized / partial offload), so 14B is the recommended sweet spot.