docs(planning): generate codebase map via gsd-map-codebase

2026-03-29 21:35:24 +09:00
parent 3377b5f68d
commit aca7bf592a
7 changed files with 96 additions and 0 deletions
--- a/.planning/codebase/ARCHITECTURE.md
+++ b/.planning/codebase/ARCHITECTURE.md
@@ -0,0 +1,17 @@
 # ARCHITECTURE
 ## System Design
 The project is built as a sequential, multi-step **Data Processing Pipeline** processing raw video into a formatted A4 PDF. The primary entry point pushes the video through 5 logical steps:
 1. **Download (`Step 1`)**: Fetches target YouTube video.
 2. **Frame Extraction (`Step 2`)**: Opens OpenCV VideoCapture, skips frames based on `DEFAULT_FPS` to limit memory, and crops out non-meaningful content.
 3. **Pattern Detection (`Step 3`)**: Classifies the scrolling behavior (e.g., `scroll` vs `overlay`) using temporal tracking and template matching.
 4. **Frame Dedup & Stitching (`Step 4/5`)**: The core logic engine. Filters out duplicates caused by video pauses or rewind/D.S. al Coda behavior. Tracks pixel movements, stitches horizontal scrolling tabs, or stacks overlay pages using `TemporalTracker`.
 5. **PDF Tiling (`Step 6`)**: Breaks stitched panoramas into A4 printable chunks and bounds them using layout metrics.
 ## Key Subsystems
 - **Temporal Tracker** (`video_cv_tracker.py`): Tracks time-series differences between frames by evaluating column/row variations rather than relying entirely on brute-force image subtraction. This captures Page Flips cleanly.
 - **Duplicate Prevention Engine**: Employs a tiered validation:
  - Phase 1: Difference Hash (`_dhash`) clustering & Laplacian Variance.
  - Phase 2: Template match history memory (Catching identical choruses via distance vs time).
  - Phase 3: OCR measure verification (Ensuring numerical monotonicity).
--- a/.planning/codebase/CONCERNS.md
+++ b/.planning/codebase/CONCERNS.md
@@ -0,0 +1,10 @@
 # CONCERNS
 ## Refactoring Needs
 - **Monolithic Script**: `youtube_tab_to_pdf.py` is over 1000 lines long. While some logic was extracted into `video_cv_tracker.py`, further decoupling of OCR validation, PDF tiling, and Downloading logic into specialized modules would reduce technical debt.
 - **Heuristic Fragility**: The pipeline extensively uses hardcoded CV heuristics (`OVERLAY_MIN_AREA_RATIO = 0.05`, `max_hamming: int = 20`, matching against `0.85` or `0.50` similarities). Small changes in target video compression can break these fragile magic numbers.
 ## Data & Quality Issues
 - **Repeating Choruses (D.S. al Coda)**: Navigating temporally jumping music is extremely difficult. The pipeline frequently struggled with overwriting chronological data when visual templates rematched an earlier chorus segment perfectly.
 - **OCR Instability**: Relying on EasyOCR to catch frame overlaps depends heavily on the video's original resolution. Fuzzy YouTube compression makes small measure numbers hard to read accurately, causing the deduplication engine to randomly fail or mis-trigger.
 - **Performance**: EasyOCR and complex morphological CV operations across thousands of frames are computationally intensive; lack of parallel processing limits speed.
--- a/.planning/codebase/CONVENTIONS.md
+++ b/.planning/codebase/CONVENTIONS.md
@@ -0,0 +1,11 @@
 # CONVENTIONS
 ## Python & Scripting
 - **Naming**: Pythonic `snake_case` for functions/variables. Internal/private helper functions are prefixed with `_` (e.g., `_dhash`, `_dedup_by_hash`).
 - **Typing**: Extensive use of Python 3 type hinting (`List`, `Tuple`, `Optional`, `np.ndarray` for image matrices) to ensure correct data tracking through the pipeline.
 - **Logging**: Script execution uses rich, numbered `print()` statements (e.g., `[1/5] ...`, `  → ...`) to give users immediate feedback on pipeline state given that processing can take minutes.
 - **Windows Safety**: Manual stdout/stderr `sys.stdout.reconfigure(encoding="utf-8")` is used to prevent Unicode block crashes on Windows cmd.exe.
 ## Computer Vision Processing
 - Memory optimization is heavily enforced (downscaling initial captures to `MAX_FRAME_WIDTH` of 1280px immediately upon read).
 - Mathematical thresholds (MSE margins, TM_CCOEFF_NORMED thresholds) are hardcoded as upper-case CONSTANTS at the top of scripts for quick tuning.
--- a/.planning/codebase/INTEGRATIONS.md
+++ b/.planning/codebase/INTEGRATIONS.md
@@ -0,0 +1,10 @@
 # INTEGRATIONS
 ## YouTube
 - **Mechanism**: Integrated natively through `yt-dlp` binary invocation via `subprocess`.
 - **Purpose**: Fetch the source guitar play-through videos asynchronously before frame extraction.
 - **Data Flow**: Downloads MP4 directly to the local filesystem (usually enforcing `res:720` for a balance between OCR fidelity and VRAM economy) before cv2 processing.
 ## EasyOCR Model
 - **Mechanism**: Lazy-loaded `en` (English/Digits) detection network.
 - **Lifecycle**: Initialized on first call in `_get_ocr_reader()`. Downloads model weights if not locally cached in the Python environment.
--- a/.planning/codebase/STACK.md
+++ b/.planning/codebase/STACK.md
@@ -0,0 +1,17 @@
 # STACK
 ## Core Language & Environment
 - **Python**: Primary language for all scripts (v3.10+ expected).
 ## Principal Libraries
 - **OpenCV (`cv2`)**: Core computer vision operations, image thresholding, morphological operations, pattern matching (Template Matching), Difference Hash (`_dhash`), Laplacian variance.
 - **NumPy (`np`)**: Matrix operations and mask manipulations for image arrays.
 - **EasyOCR (`easyocr`)**: Convolutional/Recurrent Neural Network based OCR for detecting measure numbers to prevent duplications.
 ## Utilities
 - **yt-dlp**: Command-line tool called via `subprocess` for downloading high-quality video frames (720p preferred over 1080p to optimize processing) from YouTube.
 - **img2pdf**: Lightweight library to convert combined tab images into A4 PDF output.
 - **Pillow (`PIL`)**: Additional image manipulation where OpenCV is less ergonomic.
 ## Execution Requirements
 - Windows OS (noted via `sys.platform == "win32"` encodings).
--- a/.planning/codebase/STRUCTURE.md
+++ b/.planning/codebase/STRUCTURE.md
@@ -0,0 +1,17 @@
 # STRUCTURE
 ## Root Directory (`/`)
 - `youtube_tab_to_pdf.py`: The massive core application runner (1k+ lines). Handles downloading, cv2 matrix manipulation, OCR, stitching, and PDF rendering.
 - `video_cv_tracker.py`: Extracted Temporal/Motion tracker logic responsible for analyzing page flips based on median column mutations instead of simplistic frame diffs.
 - `sim_stitch.py` / `simulate_ocr_pipeline.py`: Legacy or simulation variations used to mock or prototype OCR/stitching behaviors without redownloading heavy videos.
 - `test_pipeline.py` / `diag_v2.py`: Pipeline integration testers and diagnostic tools for manual CV metric validation.
 - `bootstrap.bat` / `bootstrap.sh`: Lifecycle and setup scripts for CI or new user initialization.
 ## `/scripts`
 - Contains focused debugging scripts (`debug/test_full_ocr.py`, `debug/rigorous_validator.py`) utilized to isolate OCR misfires or to rigidly assert measure numbers.
 ## `/docs`
 - `devlog/`: Contains markdown logs of development insights (e.g., `2026-03-29_postmortem_duplicate_row_bug.md`).
 ## `/.planning`
 - Contains project requirements, roadmaps, and this codebase mapping.
--- a/.planning/codebase/TESTING.md
+++ b/.planning/codebase/TESTING.md
@@ -0,0 +1,14 @@
 # TESTING
 ## Test Suites & Scripts
 The application uses diagnostic and simulation scripts rather than traditional `unittest` or `pytest` suites due to the heavy reliance on Computer Vision and large video downloads.
 - `test_pipeline.py`: Acts as the primary integration test, running the e2e extraction over known sample URLs to verify no missing sections or regressions occur.
 - `scripts/debug/rigorous_validator.py`: A rigid assertion script used locally to guarantee extracted sequences don't fail OCR checks and maintain strict monotonicity.
 - `scripts/debug/test_full_ocr.py`: Isolated test bench for verifying EasyOCR accuracy and tuning bounding box coordinates before baking them into the main pipeline.
 ## Validation Methodologies
 Because validating computer vision outputs is visually subjective, 'tests' in this repository focus heavily on output metrics:
 - Number of discrete pages extracted vs expected.
 - Strict ascending sequence of OCR read measure numbers.
 - Absence of specific moving artifacts (e.g., the red/blue 'Playhead cursor').