The problem
Cloud AI assistants are powerful but carry three costs that disqualify them for sensitive work: source code and business data leave the machine, every request is billed, and they stop working without a connection. Teams handling private code need agentic capability that runs entirely on their own hardware.
The approach
- Designed a six-layer local stack with a single signal path: Open WebUI (interface) → an agent framework (OpenHands / Aider) that calls real tools → Qwen 3 (reasoning) → Ollama (GPU runtime) → Qdrant (vector memory) — all open-source and self-hosted.
- Gave the agent real hands via tool/MCP integrations: read and edit code, run terminal commands, work with Git, process Excel/CSV, and query Postgres/SQLite — not chat-only output.
- Added durable memory: Qdrant holds indexed code, docs, and learned decisions/fixes, so the agent keeps context across sessions and builds a model of how a project fits together.
- Specified three hardware tiers (CPU-only 4B → RTX 3060/4060 Ti 14B → RTX 4090 32B) so the identical stack scales from a junior-dev laptop to a workstation — only the model size changes.
The outcome
- A working, reproducible build: after a one-time setup the agent runs air-gapped, charges nothing per request, and keeps functioning with no network at all.
- Documented end to end as a step-by-step developer build guide — driver setup, MCP wiring, model selection per tier, and the real troubleshooting issues hit along the way.
- Reference configuration (RTX 4060 Ti 16GB, Ubuntu 24.04, Qwen 3 14B) runs the full agent stack — model, tools, and vector memory — on commodity hardware.
Honest note: A self-directed architecture and build guide validated on a reference machine. Throughput depends on hardware (a few tokens/sec on CPU-only, fast on a mid-range GPU); 32B on 16GB VRAM is tight (quantized / partial offload), so 14B is the recommended sweet spot.