Files
variet_llm/.planning/PROJECT.md
Variet-Worker c111b3a9b0 feat: Variet Engine v1.0 + 5-model tuning complete
Phase 01 (LLM Tuning):
- Gemma4 26B: 74.65 t/s (fast)
- Qwen 35B: 61.62 t/s (balanced)
- Gemma4 31B: 16.0 t/s (deep-coder)
- Qwen 27B: 16.7 t/s (deep-logic)
- Qwen 122B: 8.95 t/s (ultra, GPU 1 only)

Phase 02 (API Engine):
- FastAPI reverse proxy on port 8000
- /engine/switch hot-swap with 503 protection
- config/engine_models.json as single source of truth
- Replaced 4 individual .bat files with unified engine

File cleanup:
- scripts/ 85 files -> 9 + _archive/
- Root .bat files -> _archive/
2026-04-07 18:08:58 +09:00

3.2 KiB

Variet LLM: Dual-Orchestration AI Assistant

What This Is

A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.

Problem / Core Value

Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.

Target Audience

Single developer working on complex coding tasks alongside daily administrative tasks.

Key Decisions

Decision Rationale Outcome
2+0 GPU Architecture Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically. Machine A: API Server only.
Machine B: All orchestrations & tools.
Separation of Agent Logic Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). Simplified infrastructure; tools execute directly on the workstation.
5-Tier Model Strategy Need balanced speeds depending on the complexity of the task requested. Fast: Gemma4 26B (~75t/s)
Balanced: Qwen 35B (~62t/s)
Deep-Coder: Gemma4 31B (~16t/s)
Deep-Logic: Qwen 27B (~17t/s)
Ultra: Qwen 122B (~9t/s)
GPU 0 PCIe x4 제약 GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수. Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용
Variet Engine (FastAPI 프록시) 단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소. engine/variet_engine.py + config/engine_models.json

Requirements

Validated

  • Deploy headless llama-server setup on Machine A. (Phase 01)
  • Build a model hot-swap utility (5-Tier) for Machine A. (Phase 02)

Active

  • Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
  • Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
  • Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.

Out of Scope

  • Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
  • Exposing Machine A to the public internet (LAN traffic only).

Last updated: 2026-04-07 after Phase 02 completion

Evolution

This document evolves at phase transitions and milestone boundaries.

After each phase transition (via /gsd-transition):

  1. Requirements invalidated? → Move to Out of Scope with reason
  2. Requirements validated? → Move to Validated with phase reference
  3. New requirements emerged? → Add to Active
  4. Decisions to log? → Add to Key Decisions
  5. "What This Is" still accurate? → Update if drifted

After each milestone (via /gsd-complete-milestone):

  1. Full review of all sections
  2. Core Value check — still the right priority?
  3. Audit Out of Scope — reasons still valid?
  4. Update Context with current state