variet_llm/.planning/PROJECT.md

# Variet LLM: Dual-Orchestration AI Assistant

## What This Is
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.

## Problem / Core Value
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (50-80 t/s with Qwen 35B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.

## Target Audience
Single developer working on complex coding tasks alongside daily administrative tasks.

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows Qwen 35B to fully load into VRAM, increasing speed from 30t/s to 50-80t/s. | Machine A: API Server only.<br/>Machine B: All orchestrations & tools. |
| Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. |
| 3-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~70t/s)<br/>Balanced: Qwen 35B (~50t/s)<br/>Deep: Qwen 122B (~11t/s) |

## Requirements

### Validated

(None yet — ship to validate)

### Active

- [ ] Deploy headless llama-server setup on Machine A.
- [ ] Build a model hot-swap utility (Fast/Balanced/Deep) for Machine A.
- [ ] Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
- [ ] Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
- [ ] Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.

### Out of Scope

- [ ] Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
- [ ] Exposing Machine A to the public internet (LAN traffic only).

---
*Last updated: 2026-04-05 after initialization*

## Evolution

This document evolves at phase transitions and milestone boundaries.

**After each phase transition** (via `/gsd-transition`):
1. Requirements invalidated? → Move to Out of Scope with reason
2. Requirements validated? → Move to Validated with phase reference
3. New requirements emerged? → Add to Active
4. Decisions to log? → Add to Key Decisions
5. "What This Is" still accurate? → Update if drifted

**After each milestone** (via `/gsd-complete-milestone`):
1. Full review of all sections
2. Core Value check — still the right priority?
3. Audit Out of Scope — reasons still valid?
4. Update Context with current state