Phase 01 (LLM Tuning): - Gemma4 26B: 74.65 t/s (fast) - Qwen 35B: 61.62 t/s (balanced) - Gemma4 31B: 16.0 t/s (deep-coder) - Qwen 27B: 16.7 t/s (deep-logic) - Qwen 122B: 8.95 t/s (ultra, GPU 1 only) Phase 02 (API Engine): - FastAPI reverse proxy on port 8000 - /engine/switch hot-swap with 503 protection - config/engine_models.json as single source of truth - Replaced 4 individual .bat files with unified engine File cleanup: - scripts/ 85 files -> 9 + _archive/ - Root .bat files -> _archive/
59 lines
3.2 KiB
Markdown
59 lines
3.2 KiB
Markdown
# Variet LLM: Dual-Orchestration AI Assistant
|
|
|
|
## What This Is
|
|
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.
|
|
|
|
## Problem / Core Value
|
|
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.
|
|
|
|
## Target Audience
|
|
Single developer working on complex coding tasks alongside daily administrative tasks.
|
|
|
|
## Key Decisions
|
|
|
|
| Decision | Rationale | Outcome |
|
|
|----------|-----------|---------|
|
|
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically. | Machine A: API Server only.<br/>Machine B: All orchestrations & tools. |
|
|
| Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. |
|
|
| 5-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~75t/s)<br/>Balanced: Qwen 35B (~62t/s)<br/>Deep-Coder: Gemma4 31B (~16t/s)<br/>Deep-Logic: Qwen 27B (~17t/s)<br/>Ultra: Qwen 122B (~9t/s) |
|
|
| GPU 0 PCIe x4 제약 | GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수. | Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용 |
|
|
| Variet Engine (FastAPI 프록시) | 단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소. | `engine/variet_engine.py` + `config/engine_models.json` |
|
|
|
|
## Requirements
|
|
|
|
### Validated
|
|
|
|
- [x] Deploy headless llama-server setup on Machine A. *(Phase 01)*
|
|
- [x] Build a model hot-swap utility (5-Tier) for Machine A. *(Phase 02)*
|
|
|
|
### Active
|
|
|
|
- [ ] Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
|
|
- [ ] Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
|
|
- [ ] Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.
|
|
|
|
### Out of Scope
|
|
|
|
- [ ] Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
|
|
- [ ] Exposing Machine A to the public internet (LAN traffic only).
|
|
|
|
---
|
|
*Last updated: 2026-04-07 after Phase 02 completion*
|
|
|
|
## Evolution
|
|
|
|
This document evolves at phase transitions and milestone boundaries.
|
|
|
|
**After each phase transition** (via `/gsd-transition`):
|
|
1. Requirements invalidated? → Move to Out of Scope with reason
|
|
2. Requirements validated? → Move to Validated with phase reference
|
|
3. New requirements emerged? → Add to Active
|
|
4. Decisions to log? → Add to Key Decisions
|
|
5. "What This Is" still accurate? → Update if drifted
|
|
|
|
**After each milestone** (via `/gsd-complete-milestone`):
|
|
1. Full review of all sections
|
|
2. Core Value check — still the right priority?
|
|
3. Audit Out of Scope — reasons still valid?
|
|
4. Update Context with current state
|