# πŸ† 2x RTX 3060 (24GB) 졜적 μΆ”λ‘  μ„€μ • β€” μ‹€μΈ‘ ν™•μ •κ°’ # ⚠️ [DEPRECATED] ⚠️ # 이 νŒŒμΌμ€ 참쑰용으둜만 λ³΄μ‘΄λ©λ‹ˆλ‹€. # ν˜„μž¬ μ‹œμŠ€ν…œμ˜ μ‹€μ œ 운영 μ„€μ •(Single Source of Truth)은 `config/engine_models.json`을 μ°Έμ‘°ν•˜μ„Έμš”. # ν…ŒμŠ€νŠΈ μΌμ‹œ: 2026-04-06 # μ»¨ν…μŠ€νŠΈ: 256K (262144) # ν•˜λ“œμ›¨μ–΄: 2x RTX 3060 12GB (Machine A) ## ═══════════════════════════════════════════════════ ## 1. Gemma 4 26B-A4B (Q4_K_M) β€” 74.65 t/s ## ═══════════════════════════════════════════════════ # μ‹€μΈ‘: AVG 74.65 / BEST 75.07 / MIN 74.27 t/s # VRAM: ~16.8 GB (μ—¬μœ  μΆ©λΆ„) # 이전 기둝: 76.4 t/s (동일 μ„€μ •) # llama-server μ‹€ν–‰ μ»€λ§¨λ“œ: # llama-server --model models\gemma-4-26B-A4B-it-Q4_K_M.gguf \ # -ngl 999 -c 262144 -np 1 -fa on \ # --cache-type-k f16 --cache-type-v f16 \ # -ub 512 -b 2048 -t 6 -tb 6 \ # --prio 3 --mlock --poll 50 \ # --port 8000 --host 0.0.0.0 GEMMA4_CONFIG = { "model": "models\\gemma-4-26B-A4B-it-Q4_K_M.gguf", "ngl": 999, "context": 262144, "np": 1, "fa": True, "cache_type_k": "f16", "cache_type_v": "f16", "ub": 512, "b": 2048, "t": 6, "tb": 6, "prio": 3, "mlock": True, "poll": 50, "measured_avg_tps": 74.65, "measured_best_tps": 75.07, } ## ═══════════════════════════════════════════════════ ## 2. Qwen 3.5 35B-A3B (Q4_K_M) β€” 61.62 t/s ## ═══════════════════════════════════════════════════ # μ‹€μΈ‘: AVG 61.62 / BEST 62.12 / MIN 61.02 t/s # VRAM: ~23.0 GB (GPU 0: 12038, GPU 1: 10942 β€” 거의 ν•œκ³„) # 이전 기둝: 64.18 t/s (동일 μ„€μ •, 3회 평균) # ⚠️ λΉ„λŒ€μΉ­ μŠ€ν”Œλ¦Ώ (0.49/0.51 μ΄ν•˜) μ‹œ 12+ t/s ν•˜λ½ λ˜λŠ” ν¬λž˜μ‹œ # ⚠️ UD-IQ4_NL μ‚¬μš© κΈˆμ§€ (μ•ˆμ •μ„± 문제) # llama-server μ‹€ν–‰ μ»€λ§¨λ“œ: # llama-server --model models\Qwen3.5-35B-A3B-Q4_K_M.gguf \ # -ngl 999 -c 262144 -np 1 -fa on \ # --cache-type-k q4_0 --cache-type-v q4_0 \ # -ub 128 -b 512 -t 6 -tb 6 \ # --prio 3 --mlock --poll 50 \ # -ts 0.5,0.5 \ # --port 8000 --host 0.0.0.0 QWEN35B_CONFIG = { "model": "models\\Qwen3.5-35B-A3B-Q4_K_M.gguf", "ngl": 999, "context": 262144, "np": 1, "fa": True, "cache_type_k": "q4_0", "cache_type_v": "q4_0", "ub": 128, "b": 512, "t": 6, "tb": 6, "prio": 3, "mlock": True, "poll": 50, "tensor_split": "0.5,0.5", "measured_avg_tps": 61.62, "measured_best_tps": 62.12, } ## ═══════════════════════════════════════════════════ ## μŠ€ν”Œλ¦Ώ ν…ŒμŠ€νŠΈ κ²°κ³Ό (Qwen 3.5 35B Q4_K_M) ## ═══════════════════════════════════════════════════ # 0.3 / 0.7 β†’ λΆ€νŒ… μ‹€νŒ¨ ❌ ## ═══════════════════════════════════════════════════ ## 3. Deep Tier - μ½”λ”© 및 μ‹œμŠ€ν…œ 섀계 μ „λ‹΄ (Gemma 4 31B Q4_K_M) ## ═══════════════════════════════════════════════════ # ν…ŒμŠ€νŠΈ μΌμ‹œ: 2026-04-07 # μ‹€μΈ‘: 16.0 t/s (192K κ·Ήν•œ μ»¨ν…μŠ€νŠΈ μ„ΈνŒ… μ‹œ) # μš©λ„ [Primary Coder]: λ³΅μž‘ν•œ Python μ½”λ”©, ν”„λ ˆμž„μ›Œν¬ μ•„ν‚€ν…μ²˜ 섀계, μ•Œκ³ λ¦¬μ¦˜ μ΅œμ ν™”, λͺ¨μ˜ ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€ μž‘μ„± λ“± "μ‹œλ‹ˆμ–΄κΈ‰ μ—”μ§€λ‹ˆμ–΄λ§ λŠ₯λ ₯이 μ œμ•½μ μœΌλ‘œ μš”κ΅¬λ˜λŠ” μž‘μ—…" μ „λ‹΄ # νŠΉμ§•: 24GB VRAM ν™˜κ²½μ—μ„œ 단일 λͺ¨λΈ ν’€ λ‘œλ”© μ‹œ μ΅œλŒ€ 192K μ»¨ν…μŠ€νŠΈλ₯Ό μ§€μ›ν•©λ‹ˆλ‹€ (ub=128 μ„Έλ°€ 컨트둀 톡과). # System Prompt λˆ„λ½μ— μƒλŒ€μ μœΌλ‘œ μœ μ—°ν•˜λ©° 창의적인 문제 해결에 λ›°μ–΄λ‚©λ‹ˆλ‹€. GEMMA4_31B_DEEP_CONFIG = { "model": "models\\gemma-4-31B-it-Q4_K_M.gguf", "ngl": 999, "context": 196608, # 192K Limit "np": 1, "fa": True, "cache_type_k": "q4_0", "cache_type_v": "q4_0", "ub": 128, "b": 512, "t": 6, "tb": 6, "prio": 3, "mlock": True, "poll": 50, "measured_avg_tps": 16.0, "role_assignment": "Primary Coder & Architect", } ## ═══════════════════════════════════════════════════ ## 4. Deep Tier - λ³΅μž‘ν•œ 논리 및 μ΄ˆλŒ€ν˜• λ¬Έμ„œ 뢄석 (Qwen 3.5 27B Q4_K_M) ## ═══════════════════════════════════════════════════ # ν…ŒμŠ€νŠΈ μΌμ‹œ: 2026-04-07 # μ‹€μΈ‘: 16.7 t/s (256K κ·Ήν•œ μ»¨ν…μŠ€νŠΈ μ„ΈνŒ… μ‹œ) # μš©λ„ [Logic Analyst]: λͺ¨ν˜Έν•˜κ±°λ‚˜ μ œν•œλœ 정보 ν•˜μ—μ„œμ˜ μΉ¨μ°©ν•œ 곡학적 논증, μˆ˜ν•™μ  풀이, 256Kλ₯Ό 꽉 μ±„μš°λŠ” λ°©λŒ€ν•œ λ¬Έμ„œ 리딩 및 핡심 κ·œμΉ™ μΆ”μΆœ # νŠΉμ§•: 256K ν’€ μ»¨ν…μŠ€νŠΈλ₯Ό μ•ˆμ •μ μœΌλ‘œ 지원(ub=512)ν•˜μ—¬ 극단적인 λ©”λͺ¨λ¦¬ 버퍼λ₯Ό κ°λ‹Ήν•©λ‹ˆλ‹€. # ⚠️ 주의: API μš”μ²­ μ‹œ λ°˜λ“œμ‹œ System Prompt ("You are a...")λ₯Ό λͺ…μ‹œν•΄μ•Ό 응닡 κ±°λΆ€(Empty Response) 버그λ₯Ό 막을 수 μžˆμŠ΅λ‹ˆλ‹€. QWEN35_27B_DEEP_CONFIG = { "model": "models\\Qwen3.5-27B-Q4_K_M.gguf", "ngl": 999, "context": 262144, # 256K Full "np": 1, "fa": True, "cache_type_k": "q4_0", "cache_type_v": "q4_0", "ub": 512, "b": 1024, "t": 6, "tb": 6, "prio": 3, "mlock": True, "poll": 50, "tensor_split": "0.5,0.5", "measured_avg_tps": 16.7, "role_assignment": "Logic Analyst & Huge Context Reader", } ## ═══════════════════════════════════════════════════ ## 5. Qwen 3.5 122B-A10B MoE (Q4_K_M) β€” 8.95 t/s ## ═══════════════════════════════════════════════════ # ν…ŒμŠ€νŠΈ μΌμ‹œ: 2026-04-07 # ν•˜λ“œμ›¨μ–΄ 이슈 반경: GPU0이 PCIe 3.0 x4 둜 μ œν•œλ˜μ–΄ λ“€μ–Ό GPU μ‚¬μš©(split) μ‹œ κ·Ήμ‹¬ν•œ 병λͺ© λ°œμƒ. # ν•΄κ²°μ±…: GPU1(Gen4 x16) 단독 μ‚¬μš© 및 Expertλ₯Ό CPU에 μ˜€ν”„λ‘œλ“œ. # μ‹€μΈ‘: AVG 8.81 / BEST 8.95 t/s # VRAM: 단일 GPUμ—μ„œ 6.5GB μœ μ§€ # μš©λ„ [Ultra-Heavy Analyst]: μ΅œλŒ€ 122B νŒŒλΌλ―Έν„°μ˜ 지식 풀이 ν•„μš”ν•œ 졜고 λ‚œμ΄λ„ μΆ”λ‘  및 μ—μ΄μ „νŠΈ μ›Œν¬ν”Œλ‘œμš° QWEN35_122B_MOE_CONFIG = { "model": "models\\Q4_K_M\\Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf", "ngl": 999, "n_cpu_moe": 48, # 16 layers expert on GPU, rest on CPU "context": 4096, # 물리적 λ©”λͺ¨λ¦¬ ν•œκ³„λ‘œ μ»¨ν…μŠ€νŠΈ ν™•μž₯ 주의 "np": 1, "fa": True, "cache_type_k": "q4_0", "cache_type_v": "q4_0", "ub": 512, "b": 2048, "t": 8, # CPU 물리 μ½”μ–΄ μˆ˜μ™€ 일치 "tb": 8, "prio": 3, "poll": 50, "main_gpu": 1, "split_mode": "none", "no_mmap": True, "measured_avg_tps": 8.81, "measured_best_tps": 8.95, "role_assignment": "Ultra-Heavy Reasoning Agent", }