2.8 KiB
Variet LLM: Dual-Orchestration AI Assistant
What This Is
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.
Problem / Core Value
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (50-80 t/s with Qwen 35B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.
Target Audience
Single developer working on complex coding tasks alongside daily administrative tasks.
Key Decisions
| Decision | Rationale | Outcome |
|---|---|---|
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows Qwen 35B to fully load into VRAM, increasing speed from 30t/s to 50-80t/s. | Machine A: API Server only. Machine B: All orchestrations & tools. |
| Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. |
| 3-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~70t/s) Balanced: Qwen 35B (~50t/s) Deep: Qwen 122B (~11t/s) |
Requirements
Validated
(None yet — ship to validate)
Active
- Deploy headless llama-server setup on Machine A.
- Build a model hot-swap utility (Fast/Balanced/Deep) for Machine A.
- Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
- Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
- Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.
Out of Scope
- Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
- Exposing Machine A to the public internet (LAN traffic only).
Last updated: 2026-04-05 after initialization
Evolution
This document evolves at phase transitions and milestone boundaries.
After each phase transition (via /gsd-transition):
- Requirements invalidated? → Move to Out of Scope with reason
- Requirements validated? → Move to Validated with phase reference
- New requirements emerged? → Add to Active
- Decisions to log? → Add to Key Decisions
- "What This Is" still accurate? → Update if drifted
After each milestone (via /gsd-complete-milestone):
- Full review of all sections
- Core Value check — still the right priority?
- Audit Out of Scope — reasons still valid?
- Update Context with current state