Files
variet_llm/scripts/bench_short.py
Variet-Worker 0dee779a73 refactor(phase-01): v3 retune fast & balanced roles
fast (Gemma 4 26B-A4B):
- Enable mmproj GPU loading (vision ~1s, 12x faster than CPU)
- KV f16 → q8_0 (save ~2.5 GB VRAM for mmproj)
- Tensor split 0.5,0.5 → 0.43,0.57 (13/17 layers)
- Remove --mlock/--poll/--prio/-t/-tb (no measurable impact)
- measured_tps 74.65 → 71.89 (trade 3.7% speed for vision)

balanced (Qwen 3.5 35B-A3B):
- Tensor split 0.5,0.5 → 0.48,0.52 (enables pipeline parallelism)
- Ubatch 128 → 256 (prefill +78%: 649 → 1,157 t/s)
- mmproj + --no-mmproj-offload (CPU vision, VRAM headroom)
- Remove useless flags same as fast
- measured_tps 61.62 → 64.16 (+4.1%)

Other:
- Document full retuning in docs/v3_{fast,balanced}_retuning_log.md
- Session report at .planning/reports/20260411-session-report.md
- Add bench utilities: bench_short/bench_long/test_ts_ratios
- Speculative decoding (E2B draft) experimented but rejected
  (+14% gen vs -31% cold start + tokenizer mismatch + mmproj conflict)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 14:55:27 +09:00

88 lines
2.9 KiB
Python

"""Phase 01 style short-prompt benchmark using llama.cpp internal timings."""
import json
import urllib.request
import sys
try:
sys.stdout.reconfigure(encoding="utf-8")
except Exception:
pass
def bench_text(model_name, n=200):
payload = json.dumps({
"model": model_name,
"messages": [{"role": "user", "content": "Count from 1 to 50, each number on a new line."}],
"max_tokens": n,
"temperature": 0,
}).encode()
req = urllib.request.Request(
"http://127.0.0.1:8000/v1/chat/completions",
data=payload,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=120) as r:
return json.loads(r.read()).get("timings", {})
def bench_image(model_name, image_path, prompt):
import base64
with open(image_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
payload = json.dumps({
"model": model_name,
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
],
}],
"max_tokens": 100,
"temperature": 0.3,
}).encode()
req = urllib.request.Request(
"http://127.0.0.1:8000/v1/chat/completions",
data=payload,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=600) as r:
return json.loads(r.read()).get("timings", {})
def main():
label = sys.argv[1] if len(sys.argv) > 1 else "run"
model = sys.argv[2] if len(sys.argv) > 2 else "fast"
do_image = "--image" in sys.argv
print(f"=== [{label}] model={model} do_image={do_image} ===")
print("warmup...")
try:
bench_text(model, 10)
except Exception as e:
print(f"warmup err: {e}")
print("text 5-run:")
runs = []
for i in range(5):
t = bench_text(model, 200)
runs.append(t["predicted_per_second"])
print(f" Run {i+1}: gen {t['predicted_per_second']:.2f} t/s ({t['predicted_n']} tok, {t['predicted_ms']:.0f}ms) | prompt {t['prompt_per_second']:.1f} t/s ({t['prompt_n']} tok)")
avg = sum(runs) / len(runs)
print(f" TEXT AVG: {avg:.2f} t/s BEST: {max(runs):.2f} MIN: {min(runs):.2f}")
if do_image:
prompts = [
"What do you see in this image? One sentence.",
"Describe the subject and background in one sentence.",
"What is the most prominent feature? One sentence.",
]
print("vision 3-run (640x640 cat):")
for i, p in enumerate(prompts):
t = bench_image(model, "logs/vision_test/sample.jpg", p)
print(f" Run {i+1}: prompt {t['prompt_n']} tok ({t['prompt_ms']:.0f}ms, {t['prompt_per_second']:.1f} t/s) | gen {t['predicted_n']} tok ({t['predicted_per_second']:.1f} t/s)")
if __name__ == "__main__":
main()