fix(bot): multi-project signal freeze — cache-only _get_channel + per-tick scanner cap
Root cause: When 3+ projects generated pending simultaneously, Bot's pending_approval_scanner made 20-40 Discord API calls in one tick (sequential await), triggering Discord 429 rate limits which blocked the entire scanner for 10-30s, freezing ALL signal delivery. Two fixes: 1. _get_channel(): Replace guild.fetch_channels() (API call) with discord.utils.get(guild.channels) (in-memory cache). Eliminates redundant API calls + Lock contention when multiple projects arrive. 2. pending_approval_scanner: Per-tick caps (5 new + 5 status) prevent one tick from monopolizing Discord API quota. Excess items are naturally processed in subsequent 3-second ticks.
This commit is contained in:
@@ -514,6 +514,12 @@
|
|||||||
- **해결**: (1) writeChatSnapshot을 루프 바깥으로 이동(resolvedCount > 0일 때 1회), (2) `pd.conversation_id === activeSessionId` 조건 추가, (3) primaryCommand에서 'Deny'/'Allow' 텍스트 제외
|
- **해결**: (1) writeChatSnapshot을 루프 바깥으로 이동(resolvedCount > 0일 때 1회), (2) `pd.conversation_id === activeSessionId` 조건 추가, (3) primaryCommand에서 'Deny'/'Allow' 텍스트 제외
|
||||||
- **주의**: Bridge 파일 루프에서 외부 시스템(Discord)에 메시지를 보낼 때는 반드시 루프 바깥에서 집계 후 1회 발송
|
- **주의**: Bridge 파일 루프에서 외부 시스템(Discord)에 메시지를 보낼 때는 반드시 루프 바깥에서 집계 후 1회 발송
|
||||||
|
|
||||||
|
### [2026-03-16] 멀티 프로젝트 동시 신호 정지 — Scanner O(N) Discord API + fetch_channels 병목
|
||||||
|
- **증상**: 여러 프로젝트(3+)가 동시에 pending을 생성하면 모든 프로젝트의 신호 전달이 정지, AI 멈춤
|
||||||
|
- **원인**: 2가지 복합: (1) `pending_approval_scanner`가 1 tick에 모든 pending(15건+)을 순차 처리 → 건당 `_get_channel()` + `channel.send/edit/fetch_message` Discord API 호출 → 20~40 API 호출이 Discord 429 rate limit 유발 → discord.py 내부 대기 → scanner tick 수십 초 블로킹 → 모든 프로젝트 신호 지연. (2) `_get_channel()`이 캐시 미스 시 `guild.fetch_channels()` (Discord API)를 asyncio Lock 내에서 호출 → Lock 대기 + API 호출 순차 발생
|
||||||
|
- **해결**: (1) `_get_channel()`을 `discord.utils.get(guild.channels)` 캐시 기반으로 변경 — API 호출 제거, (2) scanner에 per-tick cap 추가 (Phase 1: 5건, Phase 2: 5건) — 나머지는 다음 tick으로 이월
|
||||||
|
- **주의**: `guild.channels`는 discord.py 내부 캐시(Gateway 이벤트로 자동 갱신). 채널 삭제/생성은 캐시에 즉시 반영됨. per-tick cap은 최대 3초 지연을 유발하지만 전체 정지보다 훨씬 나음
|
||||||
|
|
||||||
### [2026-03-15] 이전 분석 오판(False Positive) — 교훈
|
### [2026-03-15] 이전 분석 오판(False Positive) — 교훈
|
||||||
- **증상**: 시스템 감사 시 P0/P1으로 보고한 문제들이 실제로는 코드 방어 로직(멱등성, try-catch, 의도된 exact-match)으로 이미 방어되고 있었음
|
- **증상**: 시스템 감사 시 P0/P1으로 보고한 문제들이 실제로는 코드 방어 로직(멱등성, try-catch, 의도된 exact-match)으로 이미 방어되고 있었음
|
||||||
- **원인**: 로컬 코드 스니펫만 보고 판단. 전체 데이터 생명주기를 끝까지 추적하지 않아 방어 로직을 놓침
|
- **원인**: 로컬 코드 스니펫만 보고 판단. 전체 데이터 생명주기를 끝까지 추적하지 않아 방어 로직을 놓침
|
||||||
|
|||||||
60
bot.py
60
bot.py
@@ -356,32 +356,34 @@ class GravityBot(commands.Bot):
|
|||||||
logger.info(f"Discovered {len(self.project_channels)} project channels")
|
logger.info(f"Discovered {len(self.project_channels)} project channels")
|
||||||
|
|
||||||
async def _get_channel(self, project_name: str) -> discord.TextChannel:
|
async def _get_channel(self, project_name: str) -> discord.TextChannel:
|
||||||
"""Get or create a channel for a project. Lock-protected."""
|
"""Get or create a channel for a project.
|
||||||
if project_name in self.project_channels:
|
|
||||||
return self.project_channels[project_name]
|
|
||||||
|
|
||||||
async with self._channel_lock:
|
Uses guild.channels cache first (NO API call), only locks + creates
|
||||||
# Double-check after lock
|
if channel truly doesn't exist. This prevents O(N) fetch_channels()
|
||||||
|
API calls when multiple projects arrive simultaneously.
|
||||||
|
"""
|
||||||
if project_name in self.project_channels:
|
if project_name in self.project_channels:
|
||||||
return self.project_channels[project_name]
|
return self.project_channels[project_name]
|
||||||
|
|
||||||
channel_name = self._make_channel_name(project_name)
|
channel_name = self._make_channel_name(project_name)
|
||||||
|
|
||||||
# Search existing channels FIRST (prevents duplicates)
|
# 1. Check guild channel cache (NO API call — instant)
|
||||||
try:
|
existing = discord.utils.get(
|
||||||
all_channels = await self.guild.fetch_channels()
|
self.guild.channels, name=channel_name,
|
||||||
for ch in all_channels:
|
category_id=self.session_category.id,
|
||||||
if (isinstance(ch, discord.TextChannel)
|
)
|
||||||
and ch.name == channel_name
|
if existing and isinstance(existing, discord.TextChannel):
|
||||||
and ch.category_id == self.session_category.id):
|
self.project_channels[project_name] = existing
|
||||||
self.project_channels[project_name] = ch
|
self.channel_to_project[existing.id] = project_name
|
||||||
self.channel_to_project[ch.id] = project_name
|
logger.info(f"Found channel (cache): #{channel_name}")
|
||||||
logger.info(f"Found existing channel: #{channel_name}")
|
return existing
|
||||||
return ch
|
|
||||||
except Exception as e:
|
# 2. Only lock + API call if truly creating new channel
|
||||||
logger.warning(f"fetch_channels failed: {e}")
|
async with self._channel_lock:
|
||||||
|
# Double-check after lock (another coroutine may have created it)
|
||||||
|
if project_name in self.project_channels:
|
||||||
|
return self.project_channels[project_name]
|
||||||
|
|
||||||
# No existing channel — create new
|
|
||||||
try:
|
try:
|
||||||
ch = await self.guild.create_text_channel(
|
ch = await self.guild.create_text_channel(
|
||||||
name=channel_name,
|
name=channel_name,
|
||||||
@@ -543,7 +545,11 @@ class GravityBot(commands.Bot):
|
|||||||
|
|
||||||
@tasks.loop(seconds=3)
|
@tasks.loop(seconds=3)
|
||||||
async def pending_approval_scanner(self):
|
async def pending_approval_scanner(self):
|
||||||
"""Scan bridge/pending/ for new approval requests + reload registrations."""
|
"""Scan bridge/pending/ for new approval requests + reload registrations.
|
||||||
|
|
||||||
|
Per-tick caps prevent Discord API rate limit cascade when multiple
|
||||||
|
projects generate pending files simultaneously.
|
||||||
|
"""
|
||||||
try:
|
try:
|
||||||
# Reload conv→project registrations each cycle
|
# Reload conv→project registrations each cycle
|
||||||
self._load_registrations()
|
self._load_registrations()
|
||||||
@@ -551,8 +557,14 @@ class GravityBot(commands.Bot):
|
|||||||
# Channels are created on-demand when actual signals arrive
|
# Channels are created on-demand when actual signals arrive
|
||||||
# (via _get_channel in snapshot scanner / approval sender)
|
# (via _get_channel in snapshot scanner / approval sender)
|
||||||
|
|
||||||
|
MAX_NEW_PER_TICK = 5 # Phase 1: max new pending to process per tick
|
||||||
|
MAX_STATUS_PER_TICK = 5 # Phase 2: max status changes to process per tick
|
||||||
|
phase1_processed = 0
|
||||||
|
|
||||||
requests = self.bridge.get_pending_requests()
|
requests = self.bridge.get_pending_requests()
|
||||||
for req in requests:
|
for req in requests:
|
||||||
|
if phase1_processed >= MAX_NEW_PER_TICK:
|
||||||
|
break
|
||||||
if req.request_id in self._sent_approval_ids:
|
if req.request_id in self._sent_approval_ids:
|
||||||
continue
|
continue
|
||||||
if req.discord_message_id != 0:
|
if req.discord_message_id != 0:
|
||||||
@@ -571,6 +583,7 @@ class GravityBot(commands.Bot):
|
|||||||
if req.command.strip().lower() in reject_commands:
|
if req.command.strip().lower() in reject_commands:
|
||||||
logger.warning(f"Auto-approve BLOCKED: command='{req.command}' is reject-word — skipping")
|
logger.warning(f"Auto-approve BLOCKED: command='{req.command}' is reject-word — skipping")
|
||||||
self._sent_approval_ids.add(req.request_id)
|
self._sent_approval_ids.add(req.request_id)
|
||||||
|
phase1_processed += 1
|
||||||
continue
|
continue
|
||||||
|
|
||||||
self._sent_approval_ids.add(req.request_id)
|
self._sent_approval_ids.add(req.request_id)
|
||||||
@@ -614,6 +627,7 @@ class GravityBot(commands.Bot):
|
|||||||
embed.set_footer(text=f"auto-approve | {req.request_id[:12]}")
|
embed.set_footer(text=f"auto-approve | {req.request_id[:12]}")
|
||||||
await channel.send(embed=embed)
|
await channel.send(embed=embed)
|
||||||
logger.info(f"Auto-approved: {req.request_id[:12]} project={project} btn_idx={approve_btn_index}")
|
logger.info(f"Auto-approved: {req.request_id[:12]} project={project} btn_idx={approve_btn_index}")
|
||||||
|
phase1_processed += 1
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Defer short-command pendings (e.g. "Run") by 4 cycles (~12s)
|
# Defer short-command pendings (e.g. "Run") by 4 cycles (~12s)
|
||||||
@@ -640,9 +654,13 @@ class GravityBot(commands.Bot):
|
|||||||
self._sent_approval_ids.add(req.request_id)
|
self._sent_approval_ids.add(req.request_id)
|
||||||
self._sent_commands[req.request_id] = req.command
|
self._sent_commands[req.request_id] = req.command
|
||||||
await self._send_approval_request(channel, req)
|
await self._send_approval_request(channel, req)
|
||||||
|
phase1_processed += 1
|
||||||
|
|
||||||
# ── Single-pass: handle auto_resolved, expired, and MERGE in one glob ──
|
# ── Single-pass: handle auto_resolved, expired, and MERGE in one glob ──
|
||||||
|
phase2_processed = 0
|
||||||
for f in self.bridge.pending_dir.glob("*.json"):
|
for f in self.bridge.pending_dir.glob("*.json"):
|
||||||
|
if phase2_processed >= MAX_STATUS_PER_TICK:
|
||||||
|
break
|
||||||
try:
|
try:
|
||||||
data = json.loads(f.read_text(encoding="utf-8-sig"))
|
data = json.loads(f.read_text(encoding="utf-8-sig"))
|
||||||
status = data.get("status", "pending")
|
status = data.get("status", "pending")
|
||||||
@@ -675,6 +693,7 @@ class GravityBot(commands.Bot):
|
|||||||
self._sent_commands.pop(rid, None)
|
self._sent_commands.pop(rid, None)
|
||||||
self._approval_messages.pop(rid, None)
|
self._approval_messages.pop(rid, None)
|
||||||
self._sent_approval_ids.discard(rid)
|
self._sent_approval_ids.discard(rid)
|
||||||
|
phase2_processed += 1
|
||||||
|
|
||||||
elif status == "expired":
|
elif status == "expired":
|
||||||
msg_id = data.get("discord_message_id", 0)
|
msg_id = data.get("discord_message_id", 0)
|
||||||
@@ -697,6 +716,7 @@ class GravityBot(commands.Bot):
|
|||||||
self._deferred_ids.pop(rid, None)
|
self._deferred_ids.pop(rid, None)
|
||||||
self._sent_commands.pop(rid, None)
|
self._sent_commands.pop(rid, None)
|
||||||
self._sent_approval_ids.discard(rid)
|
self._sent_approval_ids.discard(rid)
|
||||||
|
phase2_processed += 1
|
||||||
|
|
||||||
elif status == "pending":
|
elif status == "pending":
|
||||||
# MERGE check: step_probe updated command in already-sent pending
|
# MERGE check: step_probe updated command in already-sent pending
|
||||||
|
|||||||
Reference in New Issue
Block a user