4.1 KiB
Variet LLM: Dual-Orchestration AI Assistant
What This Is
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.
Problem / Core Value
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.
Target Audience
Single developer working on complex coding tasks alongside daily administrative tasks.
Key Decisions
| Decision | Rationale | Outcome |
|---|---|---|
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically. | Machine A: API Server only. Machine B: All orchestrations & tools. |
| Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. |
| 5-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~75t/s) Balanced: Qwen 35B (~62t/s) Deep-Coder: Gemma4 31B (~16t/s) Deep-Logic: Qwen 27B (~17t/s) Ultra: Qwen 122B (~9t/s) |
| GPU 0 PCIe x4 제약 | GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수. | Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용 |
| Variet Engine (FastAPI 프록시) | 단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소. | engine/variet_engine.py + config/engine_models.json |
| CLI-First 검증 전략 | VS Code Extension 개발 전 OpenClaude CLI로 먼저 에이전트 루프를 검증. 빠른 피드백 루프 확보. | openclaude/ 서브모듈 (v0.1.8) → Variet Engine 연결 |
Requirements
Validated
- Deploy headless llama-server setup on Machine A. (Phase 01)
- Build a model hot-swap utility (5-Tier) for Machine A. (Phase 02)
Active
- Connect OpenClaude CLI to Variet Engine for terminal-based coding agent. (Phase 03)
- Configure 5-Tier model routing (agentRouting) in OpenClaude. (Phase 03)
- Validate end-to-end agent loop: prompt → tool calls → code changes. (Phase 03)
- Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
- Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
- Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.
Out of Scope
- Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
- Exposing Machine A to the public internet (LAN traffic only).
Current Milestone: v1.1 OpenClaude CLI Integration
Goal: OpenClaude CLI를 Variet Engine에 연결하여 Machine B에서 터미널 기반 코딩 에이전트를 가동하고 검증한다.
Target features:
- OpenClaude → Variet Engine (Machine A:8000) 프로바이더 연결
- 5-Tier 모델 라우팅 (agentRouting) 설정
- CLI 빌드 및 동작 검증
- Variet Engine 핫스왑 연동 테스트
Last updated: 2026-04-07 — Milestone v1.1 started
Evolution
This document evolves at phase transitions and milestone boundaries.
After each phase transition (via /gsd-transition):
- Requirements invalidated? → Move to Out of Scope with reason
- Requirements validated? → Move to Validated with phase reference
- New requirements emerged? → Add to Active
- Decisions to log? → Add to Key Decisions
- "What This Is" still accurate? → Update if drifted
After each milestone (via /gsd-complete-milestone):
- Full review of all sections
- Core Value check — still the right priority?
- Audit Out of Scope — reasons still valid?
- Update Context with current state