Variet LLM: Dual-Orchestration AI Assistant

What This Is

A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.

Problem / Core Value

Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (50-80 t/s with Qwen 35B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.

Target Audience

Single developer working on complex coding tasks alongside daily administrative tasks.

Key Decisions

Decision	Rationale	Outcome
2+0 GPU Architecture	Placing both GPUs in Machine A allows Qwen 35B to fully load into VRAM, increasing speed from 30t/s to 50-80t/s.	Machine A: API Server only. Machine B: All orchestrations & tools.
Separation of Agent Logic	Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot).	Simplified infrastructure; tools execute directly on the workstation.
3-Tier Model Strategy	Need balanced speeds depending on the complexity of the task requested.	Fast: Gemma4 26B (~70t/s) Balanced: Qwen 35B (~50t/s) Deep: Qwen 122B (~11t/s)

Requirements

Validated

(None yet — ship to validate)

Active

Deploy headless llama-server setup on Machine A.
Build a model hot-swap utility (Fast/Balanced/Deep) for Machine A.
Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.

Out of Scope

Running inference directly on Machine B (It lacks VRAM/GPU resources in this architecture).
Exposing Machine A to the public internet (LAN traffic only).

Last updated: 2026-04-05 after initialization

Evolution

This document evolves at phase transitions and milestone boundaries.

After each phase transition (via /gsd-transition):

Requirements invalidated? → Move to Out of Scope with reason
Requirements validated? → Move to Validated with phase reference
New requirements emerged? → Add to Active
Decisions to log? → Add to Key Decisions
"What This Is" still accurate? → Update if drifted

After each milestone (via /gsd-complete-milestone):

Full review of all sections
Core Value check — still the right priority?
Audit Out of Scope — reasons still valid?
Update Context with current state

2.8 KiB Raw Blame History