# Variet LLM: Dual-Orchestration AI Assistant ## What This Is A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution. ## Problem / Core Value Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation. ## Target Audience Single developer working on complex coding tasks alongside daily administrative tasks. ## Key Decisions | Decision | Rationale | Outcome | |----------|-----------|---------| | 2+0 GPU Architecture | Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically. | Machine A: API Server only.
Machine B: All orchestrations & tools. | | Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. | | 5-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~75t/s)
Balanced: Qwen 35B (~62t/s)
Deep-Coder: Gemma4 31B (~16t/s)
Deep-Logic: Qwen 27B (~17t/s)
Ultra: Qwen 122B (~9t/s) | | GPU 0 PCIe x4 제약 | GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수. | Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용 | | Variet Engine (FastAPI 프록시) | 단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소. | `engine/variet_engine.py` + `config/engine_models.json` | | CLI-First 검증 전략 | VS Code Extension 개발 전 OpenClaude CLI로 먼저 에이전트 루프를 검증. 빠른 피드백 루프 확보. | `openclaude/` 서브모듈 (v0.1.8) → Variet Engine 연결 | ## Requirements ### Validated - [x] Deploy headless llama-server setup on Machine A. *(Phase 01)* - [x] Build a model hot-swap utility (5-Tier) for Machine A. *(Phase 02)* ### Active - [ ] Connect OpenClaude CLI to Variet Engine for terminal-based coding agent. *(Phase 03)* - [ ] Configure 5-Tier model routing (agentRouting) in OpenClaude. *(Phase 03)* - [ ] Validate end-to-end agent loop: prompt → tool calls → code changes. *(Phase 03)* - [ ] Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop. - [ ] Develop a Discord Bot (discord.py) on Machine B for personal assistant tools. - [ ] Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B. ### Out of Scope - [ ] Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture). - [ ] Exposing Machine A to the public internet (LAN traffic only). ## Current Milestone: v1.1 OpenClaude CLI Integration **Goal:** OpenClaude CLI를 Variet Engine에 연결하여 Machine B에서 터미널 기반 코딩 에이전트를 가동하고 검증한다. **Target features:** - OpenClaude → Variet Engine (Machine A:8000) 프로바이더 연결 - 5-Tier 모델 라우팅 (agentRouting) 설정 - CLI 빌드 및 동작 검증 - Variet Engine 핫스왑 연동 테스트 --- *Last updated: 2026-04-07 — Milestone v1.1 started* ## Evolution This document evolves at phase transitions and milestone boundaries. **After each phase transition** (via `/gsd-transition`): 1. Requirements invalidated? → Move to Out of Scope with reason 2. Requirements validated? → Move to Validated with phase reference 3. New requirements emerged? → Add to Active 4. Decisions to log? → Add to Key Decisions 5. "What This Is" still accurate? → Update if drifted **After each milestone** (via `/gsd-complete-milestone`): 1. Full review of all sections 2. Core Value check — still the right priority? 3. Audit Out of Scope — reasons still valid? 4. Update Context with current state