feat: Variet Engine v1.0 + 5-model tuning complete
Phase 01 (LLM Tuning): - Gemma4 26B: 74.65 t/s (fast) - Qwen 35B: 61.62 t/s (balanced) - Gemma4 31B: 16.0 t/s (deep-coder) - Qwen 27B: 16.7 t/s (deep-logic) - Qwen 122B: 8.95 t/s (ultra, GPU 1 only) Phase 02 (API Engine): - FastAPI reverse proxy on port 8000 - /engine/switch hot-swap with 503 protection - config/engine_models.json as single source of truth - Replaced 4 individual .bat files with unified engine File cleanup: - scripts/ 85 files -> 9 + _archive/ - Root .bat files -> _archive/
This commit is contained in:
@@ -1,45 +0,0 @@
|
||||
---
|
||||
phase: 00-initialization
|
||||
task: 0
|
||||
total_tasks: 0
|
||||
status: paused
|
||||
last_updated: 2026-04-05T00:51:15+09:00
|
||||
---
|
||||
|
||||
<current_state>
|
||||
Completed project initialization and architecture planning.
|
||||
GSD project state (.planning/PROJECT.md and config.json) corresponds to the 'Dual-Orchestration AI Assistant' structure using a 2+0 GPU division.
|
||||
Right before starting Phase 1 planning.
|
||||
</current_state>
|
||||
|
||||
<completed_work>
|
||||
- Configured git repository, remote (`Variet/variet_llm`), and Vikunja
|
||||
- Cleaned up previous `agent_guide` config
|
||||
- Wrote `.planning/PROJECT.md` outlining the 3-Tier model strategy and the requirements
|
||||
- Written `.planning/config.json`
|
||||
- Committed everything to git
|
||||
</completed_work>
|
||||
|
||||
<remaining_work>
|
||||
- Plan Phase 1: Machine A LLM inference server setup and Hot-swap scripts (Fast/Balanced/Deep)
|
||||
- Plan Phase 2: Machine B VS Code Extension
|
||||
- Plan Phase 3: Machine B Discord Bot
|
||||
- Plan Phase 4: MCP Tool integration
|
||||
</remaining_work>
|
||||
|
||||
<decisions_made>
|
||||
- Decided to use 2+0 GPU architecture because it gives single-user coding requests maximum throughput (50-80 t/s) while keeping orchestration neatly on Machine B.
|
||||
- Picked a 3-tier model strategy: Gemma4 26B (Fast), Qwen 35B (Balanced), Qwen 122B (Deep).
|
||||
</decisions_made>
|
||||
|
||||
<blockers>
|
||||
- None.
|
||||
</blockers>
|
||||
|
||||
<context>
|
||||
We transitioned from pure Llama.cpp tuning to architectural layout. The logic for how tools are routed has been clarified (LLM thinks on Machine A, tools are executed locally on Machine B). Next logical step is to execute Phase 1 (infrastructure and hot swap on Machine A).
|
||||
</context>
|
||||
|
||||
<next_action>
|
||||
Start with: `/gsd-plan-phase 1` to design the Machine A startup and hot swap mechanism.
|
||||
</next_action>
|
||||
@@ -1,34 +1,56 @@
|
||||
{
|
||||
"version": "1.0",
|
||||
"timestamp": "2026-04-06T21:18:00+09:00",
|
||||
"phase": "01",
|
||||
"phase_name": "llm-tuning",
|
||||
"phase_dir": ".planning/phases/01-llm-tuning",
|
||||
"plan": 1,
|
||||
"task": 3,
|
||||
"total_tasks": 5,
|
||||
"status": "paused",
|
||||
"completed_tasks": [
|
||||
{"id": 1, "name": "Evaluate 122B Single GPU", "status": "done", "commit": ""},
|
||||
{"id": 2, "name": "Evaluate 122B Dual GPU memory geometric splitting", "status": "done", "commit": ""},
|
||||
{"id": 3, "name": "Calculate theoretical limits of DDR4 MoE fetching", "status": "done", "commit": ""},
|
||||
{"id": 4, "name": "Test Qwen 27B Dense context bounds limits", "status": "in_progress", "progress": "Confirmed -c 262144 boots successfully"}
|
||||
"version": "2.0",
|
||||
"timestamp": "2026-04-07T18:07:00+09:00",
|
||||
"current_phase": "02-api-engine",
|
||||
"phase_status": "complete",
|
||||
"next_phase": "03-vscode-extension",
|
||||
|
||||
"completed": [
|
||||
{
|
||||
"phase": "01-llm-tuning",
|
||||
"summary": "5개 모델 최적 추론 설정 확정",
|
||||
"key_output": "config/engine_models.json",
|
||||
"metrics": {
|
||||
"fast_gemma4_26b": "74.65 t/s, 256K ctx, dual GPU",
|
||||
"balanced_qwen_35b": "61.62 t/s, 256K ctx, dual GPU",
|
||||
"deep_coder_gemma4_31b": "16.0 t/s, 192K ctx, dual GPU",
|
||||
"deep_logic_qwen_27b": "16.7 t/s, 256K ctx, dual GPU",
|
||||
"ultra_qwen_122b": "8.95 t/s, 256K ctx, GPU 1 only"
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "02-api-engine",
|
||||
"summary": "Variet Engine v1.0 — FastAPI 리버스 프록시 + 모델 핫스왑",
|
||||
"key_output": "engine/variet_engine.py",
|
||||
"features": [
|
||||
"단일 포트(8000) OpenAI-compatible API",
|
||||
"/engine/switch/{role} 핫스왑",
|
||||
"교체 중 503+Retry-After 클라이언트 보호",
|
||||
"llama-server 프로세스 생명주기 관리",
|
||||
"config/engine_models.json 기반 설정"
|
||||
],
|
||||
"verified": true
|
||||
}
|
||||
],
|
||||
"remaining_tasks": [
|
||||
{"id": 5, "name": "Evaluate Gemma-4 31B max context and speed", "status": "not_started"}
|
||||
],
|
||||
"blockers": [
|
||||
{"description": "122B Q4_K_M 20t/s Generation Speed Limit", "type": "technical", "workaround": "Physical limitation of DDR4 RAM bandwidth (50GB/s) against 4+ GB of active weights. Cannot be bypassed. Shifted focus to smaller Dense models that fit completely into VRAM."}
|
||||
],
|
||||
"human_actions_pending": [],
|
||||
"decisions": [
|
||||
{"decision": "Stop forcing Dual GPU symmetric utilization on MoE with n-cpu-moe", "rationale": "Model asymmetry forces OOM on one GPU and underutilization on the other.", "phase": "01"},
|
||||
{"decision": "Shift focus to Qwen 27B / Gemma 4 31B dense models", "rationale": "They fit 100% into VRAM, bypassing WDDM/PCIe/DDR4 bottlenecks, guaranteeing ~20+ t/s generation speeds.", "phase": "01"}
|
||||
],
|
||||
"uncommitted_files": [
|
||||
"scripts/find_max_dense.mjs",
|
||||
"scripts/tune_122b_20ts.mjs"
|
||||
],
|
||||
"next_action": "Complete speed benchmark for Qwen 27B and find max context for Gemma 4 31B",
|
||||
"context_notes": "We successfully shifted the user's focus away from physically impossible 122B Q4_K_M constraints by laying down concrete mathematical logic about VRAM/RAM bandwidth. We are now pivoting to dense models (27B/31B) to guarantee speed and context size."
|
||||
|
||||
"hardware_notes": {
|
||||
"gpu0": "RTX 3060 12GB — PCIe 3.0 x4 (B550 chipset, ~4 GB/s)",
|
||||
"gpu1": "RTX 3060 12GB — PCIe 4.0 x16 (CPU direct, ~32 GB/s)",
|
||||
"constraint": "122B MoE는 GPU 1 단독 사용 필수 (-sm none --main-gpu 1)",
|
||||
"ram": "96GB DDR4"
|
||||
},
|
||||
|
||||
"project_structure": {
|
||||
"config/engine_models.json": "5개 모델 CLI 설정 (Single Source of Truth)",
|
||||
"engine/variet_engine.py": "FastAPI 프록시 + llama-server 관리자",
|
||||
"start_variet_engine.bat": "원클릭 엔진 런처",
|
||||
"scripts/optimal_configs.py": "실측 레퍼런스 (deprecated)",
|
||||
"scripts/_archive/": "튜닝/벤치마크 이력 보관"
|
||||
},
|
||||
|
||||
"next_steps": [
|
||||
"Machine A에서 start_variet_engine.bat 상시 구동 설정 (작업 스케줄러 or 서비스)",
|
||||
"Machine B에서 VS Code Extension 개발 시작 (에이전트 루프)",
|
||||
"Machine B에서 Discord Bot 개발 시작 (개인 비서)"
|
||||
]
|
||||
}
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.
|
||||
|
||||
## Problem / Core Value
|
||||
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (50-80 t/s with Qwen 35B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.
|
||||
Standard LLM set-ups on a single GPU often struggle with context switching and running multi-tools asynchronously. By dedicating an API server to raw inference (up to 75 t/s with Gemma4 26B), the system achieves extreme responsiveness for coding while preserving resources for tool execution (Calendar, Mail, Search) on the workstation.
|
||||
|
||||
## Target Audience
|
||||
Single developer working on complex coding tasks alongside daily administrative tasks.
|
||||
@@ -13,20 +13,21 @@ Single developer working on complex coding tasks alongside daily administrative
|
||||
|
||||
| Decision | Rationale | Outcome |
|
||||
|----------|-----------|---------|
|
||||
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows Qwen 35B to fully load into VRAM, increasing speed from 30t/s to 50-80t/s. | Machine A: API Server only.<br/>Machine B: All orchestrations & tools. |
|
||||
| 2+0 GPU Architecture | Placing both GPUs in Machine A allows models to fully load into VRAM, increasing speed dramatically. | Machine A: API Server only.<br/>Machine B: All orchestrations & tools. |
|
||||
| Separation of Agent Logic | Machine A is a pure "brain" (llama-server). Machine B has the "hands and eyes" (VS Code extension and Discord Bot). | Simplified infrastructure; tools execute directly on the workstation. |
|
||||
| 3-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~70t/s)<br/>Balanced: Qwen 35B (~50t/s)<br/>Deep: Qwen 122B (~11t/s) |
|
||||
| 5-Tier Model Strategy | Need balanced speeds depending on the complexity of the task requested. | Fast: Gemma4 26B (~75t/s)<br/>Balanced: Qwen 35B (~62t/s)<br/>Deep-Coder: Gemma4 31B (~16t/s)<br/>Deep-Logic: Qwen 27B (~17t/s)<br/>Ultra: Qwen 122B (~9t/s) |
|
||||
| GPU 0 PCIe x4 제약 | GPU 0이 PCIe 3.0 x4 슬롯에 물려 대역폭이 1/8. MoE 모델(122B)은 GPU 1 단독 사용 필수. | Dense 모델은 듀얼 GPU, MoE Ultra는 GPU 1 전용 |
|
||||
| Variet Engine (FastAPI 프록시) | 단일 포트(8000)에서 모든 API 중계 + 핫스왑. 개별 .bat 파일 난립 해소. | `engine/variet_engine.py` + `config/engine_models.json` |
|
||||
|
||||
## Requirements
|
||||
|
||||
### Validated
|
||||
|
||||
(None yet — ship to validate)
|
||||
- [x] Deploy headless llama-server setup on Machine A. *(Phase 01)*
|
||||
- [x] Build a model hot-swap utility (5-Tier) for Machine A. *(Phase 02)*
|
||||
|
||||
### Active
|
||||
|
||||
- [ ] Deploy headless llama-server setup on Machine A.
|
||||
- [ ] Build a model hot-swap utility (Fast/Balanced/Deep) for Machine A.
|
||||
- [ ] Develop a VS Code Extension (TypeScript) on Machine B for coding agent loop.
|
||||
- [ ] Develop a Discord Bot (discord.py) on Machine B for personal assistant tools.
|
||||
- [ ] Implement MCP tools (SearXNG, Google Calendar, Gmail) securely on Machine B.
|
||||
@@ -37,7 +38,7 @@ Single developer working on complex coding tasks alongside daily administrative
|
||||
- [ ] Exposing Machine A to the public internet (LAN traffic only).
|
||||
|
||||
---
|
||||
*Last updated: 2026-04-05 after initialization*
|
||||
*Last updated: 2026-04-07 after Phase 02 completion*
|
||||
|
||||
## Evolution
|
||||
|
||||
|
||||
40
.planning/STATE.md
Normal file
40
.planning/STATE.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Project State
|
||||
|
||||
## Project Reference
|
||||
A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.
|
||||
|
||||
## Current Position
|
||||
Phase: 02-api-engine (Complete) -> Ready for Phase 3
|
||||
Plan: None
|
||||
Status: Transitioning to Phase 3
|
||||
|
||||
## Progress
|
||||
[████████████████████] 100% (Phase 01: LLM Tuning)
|
||||
[████████████████████] 100% (Phase 02: API Engine)
|
||||
|
||||
## Completed Phases
|
||||
- Phase 01 (LLM Tuning): 5개 모델 최적 설정 확정 (74.65 / 61.62 / 16.0 / 16.7 / 8.95 t/s)
|
||||
- Phase 02 (API Engine): Variet Engine v1.0 — FastAPI 프록시 + 핫스왑 + 503 보호
|
||||
|
||||
## Recent Decisions
|
||||
- 2+0 GPU Architecture (Machine A API Server, Machine B tools).
|
||||
- 5-tier model strategy: fast/balanced/deep-coder/deep-logic/ultra.
|
||||
- GPU 0 PCIe x4 제약 → 122B MoE는 GPU 1 단독 사용.
|
||||
- Variet Engine: 단일 포트(8000) FastAPI 리버스 프록시.
|
||||
- config/engine_models.json → 모든 설정의 Single Source of Truth.
|
||||
|
||||
## Pending Todos
|
||||
0 pending.
|
||||
|
||||
## Blockers/Concerns
|
||||
None.
|
||||
|
||||
## Next Phases (Suggested)
|
||||
- Phase 03: VS Code Extension (에이전트 루프, 도구 통합)
|
||||
- Phase 04: Discord Bot (개인 비서, 슬래시 명령어)
|
||||
- Phase 05: MCP Tools (SearXNG, Calendar, Gmail)
|
||||
|
||||
## Session Continuity
|
||||
Last session: 2026-04-07T18:07:00+09:00
|
||||
Stopped at: Phase 02 complete, GSD sync in progress
|
||||
Resume file: .planning/HANDOFF.json
|
||||
43
.planning/phases/01-llm-tuning/.continue-here.md
Normal file
43
.planning/phases/01-llm-tuning/.continue-here.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
phase: 01-llm-tuning
|
||||
task: 3
|
||||
total_tasks: 5
|
||||
status: in_progress
|
||||
last_updated: 2026-04-06T21:18:00+09:00
|
||||
---
|
||||
|
||||
<current_state>
|
||||
We are currently assessing the max context bounds and generation speed for dense/mid-sized models (Qwen 27B and Gemma 4 31B) in Q4_K_M formats. Qwen 27B booted successfully with `-c 262144`. We need to run its benchmark and then move on to testing the Gemma 4 31B context bounding limit to see if it also fits 256K.
|
||||
</current_state>
|
||||
|
||||
<completed_work>
|
||||
|
||||
- Task 1: Evaluate 122B Dual GPU vs Single GPU dynamics - Done
|
||||
- Task 2: Prove physical memory bandwidth limits of DDR4 on MoE architecture - Done
|
||||
- Task 3: Test Qwen 27B Dense max logic - In progress, booted successfully at -c 262144 inside 24GB VRAM
|
||||
</completed_work>
|
||||
|
||||
<remaining_work>
|
||||
|
||||
- Task 3: Finish speed benchmark of Qwen 27B at 256K context.
|
||||
- Task 4: Find maximum stable context for Gemma-4 31B Q4_K_M (17.0GB) and speed test.
|
||||
</remaining_work>
|
||||
|
||||
<decisions_made>
|
||||
|
||||
- Concluded that hitting 20t/s on 122B Q4_K_M is physically impossible via system DDR4 RAM. The limit is ~10-12 t/s.
|
||||
- Addressed `cudaMalloc failed` for dual GPU memory splitting. `n-cpu-moe` leaves a massive asymmetry that intrinsically fails to full-load dual 12GB VRAM cards efficiently.
|
||||
- Pivoted entirely away from 122B and 35B optimization, redirecting efforts to dense models (27B and 31B) to guarantee speed and 256K context.
|
||||
</decisions_made>
|
||||
|
||||
<blockers>
|
||||
- None. Hardware limitations acknowledged and bounded.
|
||||
</blockers>
|
||||
|
||||
<context>
|
||||
The user demanded explicit proof and answers regarding hardware utilization and VRAM filling geometry. With those physically justified, they requested a shift to new assets (Qwen 27B, Gemma 4 31B). We found that 27B at Q4_K_M (15.5GB) fits 256K into the dual RTX 3060 perfectly.
|
||||
</context>
|
||||
|
||||
<next_action>
|
||||
Start with: Re-run `node scripts/find_max_dense.mjs` but make sure `CUDA_VISIBLE_DEVICES` correctly spans all GPUs or is explicitly blank (`$env:CUDA_VISIBLE_DEVICES=""`), to get the speed test output for Qwen 27B and Gemma 31B.
|
||||
</next_action>
|
||||
50
.planning/phases/01-llm-tuning/PLAN.md
Normal file
50
.planning/phases/01-llm-tuning/PLAN.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Phase 01: LLM Tuning — PLAN
|
||||
|
||||
## 목표
|
||||
듀얼 RTX 3060 (24GB VRAM, 96GB DDR4) 환경에서 5가지 모델의 최적 추론 설정을 확정한다.
|
||||
|
||||
## 완료 태스크
|
||||
|
||||
### Task 1: Gemma 4 26B-A4B (Fast Tier) ✅
|
||||
- 실측: **74.65 t/s** (AVG), 256K 컨텍스트
|
||||
- 듀얼 GPU, 캐시 타입 f16, mlock 활성
|
||||
- 전체 VRAM 적재 (16.8GB)
|
||||
|
||||
### Task 2: Qwen 3.5 35B-A3B (Balanced Tier) ✅
|
||||
- 실측: **61.62 t/s** (AVG), 256K 컨텍스트
|
||||
- 듀얼 GPU, tensor split 0.5/0.5, 캐시 q4_0
|
||||
- 비대칭 스플릿 시 12+ t/s 하락 확인 → 0.5/0.5 확정
|
||||
|
||||
### Task 3: Gemma 4 31B Dense (Deep Coder) ✅
|
||||
- 실측: **16.0 t/s** (AVG), 192K 컨텍스트 (한계)
|
||||
- 듀얼 GPU, 전체 VRAM 적재
|
||||
- 256K 시 OOM, 192K가 안정 최대값
|
||||
|
||||
### Task 4: Qwen 3.5 27B Dense (Deep Logic) ✅
|
||||
- 실측: **16.7 t/s** (AVG), 256K 풀 컨텍스트
|
||||
- 듀얼 GPU, tensor split 0.5/0.5
|
||||
- System Prompt 누락 시 Empty Response 버그 확인
|
||||
|
||||
### Task 5: Qwen 3.5 122B-A10B MoE (Ultra Heavy) ✅
|
||||
- 실측: **8.95 t/s** (BEST), 256K 컨텍스트
|
||||
- **GPU 1 단독 사용** (-sm none --main-gpu 1)
|
||||
- Expert CPU 오프로드 (n-cpu-moe=48)
|
||||
- PCIe x4 병목 발견 → GPU 0 제외로 2배 속도 향상 (4.8 → 8.95 t/s)
|
||||
|
||||
## 핵심 발견사항 (Key Findings)
|
||||
|
||||
### GPU 0 PCIe x4 병목
|
||||
- 메인보드: Gigabyte B550M AORUS ELITE
|
||||
- GPU 0: PCIe 3.0 x4 (~4 GB/s) — 보조 슬롯
|
||||
- GPU 1: PCIe 4.0 x16 (~32 GB/s) — 메인 슬롯
|
||||
- MoE 모델처럼 CPU↔GPU 데이터 교환이 잦은 경우 GPU 0은 병목
|
||||
- Dense 모델(VRAM 내 100% 적재)에서는 영향 미미
|
||||
|
||||
### 3-Tier → 5-Tier 전략 확장
|
||||
- 원래 Fast/Balanced/Deep 3단계 → 5단계로 확장
|
||||
- deep-coder (Gemma 4 31B)와 deep-logic (Qwen 27B) 추가
|
||||
|
||||
## 산출물
|
||||
- `scripts/optimal_configs.py` — 실측값 레퍼런스 (deprecated → engine_models.json)
|
||||
- `config/engine_models.json` — 프로덕션 설정 (Single Source of Truth)
|
||||
- `scripts/_archive/results/` — 모든 벤치마크 결과 JSON
|
||||
23
.planning/phases/01-llm-tuning/VERIFICATION.md
Normal file
23
.planning/phases/01-llm-tuning/VERIFICATION.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Phase 01: LLM Tuning — Verification Report
|
||||
|
||||
## 검증 결과
|
||||
|
||||
| # | 모델 | 역할 | 실측 속도 | 컨텍스트 | GPU 구성 | 상태 |
|
||||
|:-:|------|------|:--------:|:-------:|----------|:----:|
|
||||
| 1 | Gemma 4 26B-A4B Q4_K_M | fast | 74.65 t/s | 256K | 듀얼 | ✅ |
|
||||
| 2 | Qwen 3.5 35B-A3B Q4_K_M | balanced | 61.62 t/s | 256K | 듀얼 | ✅ |
|
||||
| 3 | Gemma 4 31B Dense Q4_K_M | deep-coder | 16.0 t/s | 192K | 듀얼 | ✅ |
|
||||
| 4 | Qwen 3.5 27B Dense Q4_K_M | deep-logic | 16.7 t/s | 256K | 듀얼 | ✅ |
|
||||
| 5 | Qwen 3.5 122B-A10B Q4_K_M | ultra | 8.95 t/s | 256K | GPU 1만 | ✅ |
|
||||
|
||||
## UAT 기준 달성 여부
|
||||
|
||||
- [x] Fast tier ≥ 70 t/s → **74.65 t/s** ✅
|
||||
- [x] Balanced tier ≥ 50 t/s → **61.62 t/s** ✅
|
||||
- [x] Deep tier 안정 구동 → **16.0 / 16.7 t/s** ✅
|
||||
- [x] Ultra tier 구동 가능 → **8.95 t/s** ✅
|
||||
- [x] 모든 모델 VRAM 12GB x 2 이내 → ✅
|
||||
- [x] 최적 설정값 JSON으로 체계화 → `config/engine_models.json` ✅
|
||||
|
||||
## Phase Status: ✅ COMPLETE
|
||||
완료 일시: 2026-04-07
|
||||
62
.planning/phases/02-api-engine/PLAN.md
Normal file
62
.planning/phases/02-api-engine/PLAN.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Phase 02: API Engine — PLAN
|
||||
|
||||
## 목표
|
||||
Machine A의 추론 서버를 단일 포트(8000)로 통합하는 FastAPI 기반 리버스 프록시 + 핫스왑 엔진을 구축한다.
|
||||
|
||||
## 아키텍처
|
||||
|
||||
```
|
||||
Machine B (VS Code / Discord)
|
||||
│
|
||||
│ OpenAI-compatible API
|
||||
│ (always port 8000)
|
||||
▼
|
||||
Variet Engine (FastAPI, Port 8000)
|
||||
├── /v1/* → llama-server 투명 중계
|
||||
├── /engine/status → 현재 모델/상태 조회
|
||||
├── /engine/models → 사용 가능 모델 목록
|
||||
├── /engine/switch → 모델 핫스왑 요청
|
||||
└── /engine/health → 헬스체크
|
||||
│
|
||||
│ localhost only
|
||||
▼
|
||||
llama-server (Port 8080, 외부 노출 X)
|
||||
```
|
||||
|
||||
## 완료 태스크
|
||||
|
||||
### Task 1: 설정 파일 (`config/engine_models.json`) ✅
|
||||
- 5개 모델의 CLI 인수를 정확한 배열 형태로 저장
|
||||
- Single Source of Truth: Python/TypeScript/Bash 어디서든 파싱 가능
|
||||
- llama-server의 대시 규칙(-ngl vs --prio)을 직접 명시하여 추측 로직 제거
|
||||
|
||||
### Task 2: 엔진 코어 (`engine/variet_engine.py`) ✅
|
||||
- FastAPI + lifespan 이벤트 (deprecated on_event 사용 안 함)
|
||||
- subprocess.Popen으로 llama-server 생명주기 관리
|
||||
- httpx 스트리밍 프록시 (SSE 포함)
|
||||
- 교체 중 503 + Retry-After 응답으로 클라이언트 보호
|
||||
- 프로세스 사망 감지, 포트 해제 대기, 에러 핸들링
|
||||
|
||||
### Task 3: 런처 (`start_variet_engine.bat`) ✅
|
||||
- 원클릭 엔진 시작 스크립트
|
||||
- 기존 4개 개별 .bat 파일 → `_archive/`로 이동
|
||||
|
||||
### Task 4: 파일 정리 ✅
|
||||
- `scripts/` 85개 파일 → 9개 + `_archive/` 정리
|
||||
- 루트 deprecated .bat 4개 → `_archive/`
|
||||
|
||||
## 기술적 결정사항
|
||||
|
||||
| 결정 | 이유 |
|
||||
|------|------|
|
||||
| args를 dict → list로 변경 | llama-server의 축약 플래그(-ngl)와 정식 플래그(--prio) 규칙이 불규칙하여 코드 추측이 실패 |
|
||||
| 내부 포트 8080 | 외부(8000)와 분리하여 보안 강화 |
|
||||
| BackgroundTasks로 핫스왑 | switch API가 즉시 응답 → 백그라운드에서 교체 진행 |
|
||||
| 스트리밍 프록시 | SSE 스트리밍 응답을 chunk 단위로 전달하여 지연 최소화 |
|
||||
|
||||
## 산출물
|
||||
- `engine/variet_engine.py` — FastAPI 프록시 + 프로세스 관리자 (398줄)
|
||||
- `engine/__init__.py` — 패키지 초기화
|
||||
- `config/engine_models.json` — 5개 모델 설정
|
||||
- `start_variet_engine.bat` — 원클릭 런처
|
||||
- `scripts/test_hotswap.py` — 핫스왑 검증 스크립트
|
||||
58
.planning/phases/02-api-engine/VERIFICATION.md
Normal file
58
.planning/phases/02-api-engine/VERIFICATION.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Phase 02: API Engine — Verification Report
|
||||
|
||||
## 테스트 결과 (2026-04-07)
|
||||
|
||||
### Test 1: 부팅 및 /engine/status ✅
|
||||
```json
|
||||
{
|
||||
"state": "ready",
|
||||
"role": "fast",
|
||||
"display_name": "Gemma 4 26B (Fast)",
|
||||
"measured_tps": 74.65,
|
||||
"context_size": "262144",
|
||||
"uptime_seconds": 40.5
|
||||
}
|
||||
```
|
||||
- 기본 모델(fast) 자동 로드: 14.5초
|
||||
|
||||
### Test 2: /engine/models ✅
|
||||
- 5개 Role 전부 조회 가능
|
||||
- 각 모델의 display_name, measured_tps, context_size 표시
|
||||
|
||||
### Test 3: /v1/chat/completions 프록시 ✅
|
||||
- llama-server(:8080)로 투명 중계 정상
|
||||
- 스트리밍 응답 포함
|
||||
|
||||
### Test 4: 핫스왑 fast → balanced ✅
|
||||
```json
|
||||
{
|
||||
"status": "switching",
|
||||
"from_role": "fast",
|
||||
"to_role": "balanced",
|
||||
"to_model": "Qwen 3.5 35B (Balanced)",
|
||||
"eta_seconds": 30
|
||||
}
|
||||
```
|
||||
- 교체 소요: 20초
|
||||
- 교체 후 Qwen 35B 정상 응답 확인
|
||||
|
||||
### Test 5: 교체 중 503 보호 ✅
|
||||
- Status: **503 Service Unavailable**
|
||||
- Retry-After: **30**
|
||||
- 클라이언트가 재시도 가능한 에러 구조
|
||||
|
||||
### Test 6: 교체 후 새 모델 작동 ✅
|
||||
- Current model: Qwen 3.5 35B (Balanced)
|
||||
- Speed: 19.7 t/s (첫 요청, 워밍업 미완료)
|
||||
|
||||
## UAT 기준 달성 여부
|
||||
|
||||
- [x] 단일 포트(8000)에서 모든 API 제공 → ✅
|
||||
- [x] /v1/* 요청이 llama-server로 투명 중계 → ✅
|
||||
- [x] 핫스왑 API로 모델 교체 가능 → ✅
|
||||
- [x] 교체 중 503 + Retry-After 반환 → ✅
|
||||
- [x] 5개 모델 설정 JSON 관리 → ✅
|
||||
- [x] 원클릭 부팅 .bat → ✅
|
||||
|
||||
## Phase Status: ✅ COMPLETE
|
||||
완료 일시: 2026-04-07
|
||||
Reference in New Issue
Block a user