Files

Variet-Worker 0dee779a73 refactor(phase-01): v3 retune fast & balanced roles

fast (Gemma 4 26B-A4B):
- Enable mmproj GPU loading (vision ~1s, 12x faster than CPU)
- KV f16 → q8_0 (save ~2.5 GB VRAM for mmproj)
- Tensor split 0.5,0.5 → 0.43,0.57 (13/17 layers)
- Remove --mlock/--poll/--prio/-t/-tb (no measurable impact)
- measured_tps 74.65 → 71.89 (trade 3.7% speed for vision)

balanced (Qwen 3.5 35B-A3B):
- Tensor split 0.5,0.5 → 0.48,0.52 (enables pipeline parallelism)
- Ubatch 128 → 256 (prefill +78%: 649 → 1,157 t/s)
- mmproj + --no-mmproj-offload (CPU vision, VRAM headroom)
- Remove useless flags same as fast
- measured_tps 61.62 → 64.16 (+4.1%)

Other:
- Document full retuning in docs/v3_{fast,balanced}_retuning_log.md
- Session report at .planning/reports/20260411-session-report.md
- Add bench utilities: bench_short/bench_long/test_ts_ratios
- Speculative decoding (E2B draft) experimented but rejected
  (+14% gen vs -31% cold start + tokenizer mismatch + mmproj conflict)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 14:55:27 +09:00

2.5 KiB

Raw Blame History

gsd_state_version, milestone, milestone_name, status, last_updated, last_activity, progress

gsd_state_version

milestone

milestone_name

status

last_updated

last_activity

progress

1.0

v1.1

milestone

planning

2026-04-11T10:30:00.000Z

2026-04-11

total_phases	completed_phases	total_plans	completed_plans
3	2	3	2

Project State

Project Reference

A high-performance, locally-hosted AI assistant system built on two RTX 3060 12GB GPUs. It uses a "2+0" architecture where Machine A acts as a dedicated inference server running large language models, while Machine B handles the user interface (VS Code, Discord) and tool execution.

Current Position

Phase: 05 Plan: 05-PLAN.md (1 of 1) Status: Ready to execute Last activity: 2026-04-08

Progress

[████████████████████] 100% (Phase 01: LLM Tuning) [████████████████████] 100% (Phase 02: API Engine)

Completed Phases

Phase 01 (LLM Tuning): 5개 모델 최적 설정 확정 (71.89‡ / 64.16† / 16.0 / 16.7 / 8.95 t/s) — † balanced / ‡ fast 2026-04-11 재튜닝
Phase 02 (API Engine): Variet Engine v1.0 — FastAPI 프록시 + 핫스왑 + 503 보호

Recent Decisions

2+0 GPU Architecture (Machine A API Server, Machine B tools).
5-tier model strategy: fast/balanced/deep-coder/deep-logic/ultra.
GPU 0 PCIe x4 제약 → 122B MoE는 GPU 1 단독 사용.
Variet Engine: 단일 포트(8000) FastAPI 리버스 프록시.
config/engine_models.json → 모든 설정의 Single Source of Truth.
CLI-First 검증 전략: VS Code Extension 전 OpenClaude CLI로 에이전트 루프 먼저 검증.
balanced 역할 v3 재튜닝 (2026-04-11): -ub 256, -ts 0.48,0.52, --no-mmproj-offload, 보조 옵션(mlock/poll/prio) 제거. 실측 61.62 → 64.16 t/s. prefill 649 → 1,157 t/s (+78%). 상세: docs/v3_balanced_retuning_log.md.
fast 역할 v3 재튜닝 (2026-04-11): cache-type q8_0, -ts 0.43,0.57, --mmproj GPU 적재, 보조 옵션 제거. 실측 74.65 → 71.89 t/s (-3.7%). Vision GPU 지원 추가 (이미지 인코딩 ~1초). Speculative Decoding (E2B draft) 실험 후 채택 안 함. 상세: docs/v3_fast_retuning_log.md.

Roadmap Evolution

Phase 6 added: Install and evaluate Hermes Agent

Pending Todos

0 pending.

Blockers/Concerns

None.

Session Continuity

Last session: 2026-04-11T10:30:00+09:00 Stopped at: balanced 역할 v3 재튜닝 완료 — config/engine_models.json, docs/v3_balanced_retuning_log.md, Phase 01 VERIFICATION.md 증적 저장 완료. 다음 작업 선택 대기.

2.5 KiB Raw Blame History