Variet LLM: Dual-Orchestration AI Assistant

What This Is

A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.

Problem / Core Value

Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.

Target Audience

Single developer working on complex coding tasks alongside daily administrative tasks.

Key Decisions

Decision	Rationale	Outcome
2+0 GPU Architecture	Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically.	Machine A: API Server only. Machine B: All orchestrations & tools.
Separation of Agent Logic	Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot).	Simplified infrastructure; tools execute directly on the workstation.
5-Tier Model Strategy	Need balanced speeds depending on the complexity of the task requested.	Fast: Gemma4 26B (~75t/s) Balanced: Qwen 35B (~62t/s) Deep-Coder: Gemma4 31B (~16t/s) Deep-Logic: Qwen 27B (~17t/s) Ultra: Qwen 122B (~9t/s)
GPU 0 PCIe x4 제약	GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수.	Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용
Variet Engine (FastAPI 프록시)	단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소.	`engine/variet_engine.py` + `config/engine_models.json`
CLI-First 검증 전략	VS Code Extension 개발 전 OpenClaude CLI로 먼저 에이전트 루프를 검증. 빠른 피드백 루프 확보.	`openclaude/` 서브모듈 (v0.1.8) → Variet Engine 연결

Requirements

Validated

Deploy headless llama-server setup on Machine A. (Phase 01)
Build a model hot-swap utility (5-Tier) for Machine A. (Phase 02)

Active

Connect OpenClaude CLI to Variet Engine for terminal-based coding agent. (Phase 03)
Configure 5-Tier model routing (agentRouting) in OpenClaude. (Phase 03)
Validate end-to-end agent loop: prompt → tool calls → code changes. (Phase 03)
Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.

Out of Scope

Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
Exposing Machine A to the public internet (LAN traffic only).

Current Milestone: v1.1 OpenClaude CLI Integration

Goal: OpenClaude CLI를 Variet Engine에 연결하여 Machine B에서 터미널 기반 코딩 에이전트를 가동하고 검증한다.

Target features:

OpenClaude → Variet Engine (Machine A:8000) 프로바이더 연결
5-Tier 모델 라우팅 (agentRouting) 설정
CLI 빌드 및 동작 검증
Variet Engine 핫스왑 연동 테스트

Last updated: 2026-04-07 — Milestone v1.1 started

Evolution

This document evolves at phase transitions and milestone boundaries.

After each phase transition (via /gsd-transition):

Requirements invalidated? → Move to Out of Scope with reason
Requirements validated? → Move to Validated with phase reference
New requirements emerged? → Add to Active
Decisions to log? → Add to Key Decisions
"What This Is" still accurate? → Update if drifted

After each milestone (via /gsd-complete-milestone):

Full review of all sections
Core Value check — still the right priority?
Audit Out of Scope — reasons still valid?
Update Context with current state

4.1 KiB Raw Blame History