How It Works

Pipeline overview

Source (URL or file)
  |
  v
acquire.py -- Download via yt-dlp or resolve local file
  |
  v
extract.py -- Extract key frames + audio via ffmpeg
  |
  v
analyze.py -- Send frames/video/audio to vision backend
  |
  v
watch.py -- Check cache, orchestrate strategy, save intermediates
  |
  v
analyze.py -- Synthesize final report with codebase context
  |
  v
Structured markdown report

Acquisition

acquire.py handles two cases:

URLs: Downloads via yt-dlp. Supports YouTube, Loom, Vimeo, Twitter/X, and 1000+ sites. Merges to MP4. Extracts the video title from metadata.
Local files: Resolves the path. Detects media type from extension.

Returns: file path, media type (video or image), title, source URL.

Frame extraction

extract.py extracts key frames from videos using ffmpeg. The strategy:

Scene-change detection -- by default, eyeroll uses pixel-diff scene detection (--scene-threshold, default 30.0) to extract frames at visual transitions rather than fixed intervals. Set to 0 for fixed-interval extraction (1 frame every 2 seconds).
Deduplicate -- compare JPEG file sizes between consecutive frames. If the size difference is under 5KB, the frames look similar and the duplicate is dropped. This removes static periods without needing OpenCV.
Enhance contrast -- apply eq=contrast=1.3:brightness=0.05 via ffmpeg. This helps vision models read text on screen, especially with local models.
Cap at max_frames -- if more than max_frames remain after dedup, evenly sample down.

A typical 30-second to 2-minute video produces 8-15 meaningful frames.

Audio extraction

If the backend supports audio and the video has an audio track (detected via ffprobe), the audio is extracted as MP3 using ffmpeg. Silent or near-empty audio files are discarded. Use --min-audio-confidence (default 0.4) to filter low-confidence Whisper segments.

Preflight check

Before any analysis, eyeroll runs a preflight check that:

Verifies the backend is reachable (API key valid, server running)
Detects capabilities (video upload, batch frames, audio transcription, max video size)

If the backend is not reachable, eyeroll fails fast with a clear error before wasting time on frame extraction.

Analysis strategy

The orchestrator (watch.py) uses preflight capabilities to choose the best strategy:

Direct video upload

Used when:

Backend supports video upload (Gemini or TwelveLabs)
Video is within size limits (2GB for Gemini API key, 20MB for Vertex AI service account, 200MB local files for TwelveLabs)
Video is under 1 hour

Gemini API key users get the File API, which handles resumable uploads up to 2GB. The model sees motion, transitions, and timing.

TwelveLabs uses direct asset upload and Pegasus analysis to produce the final structured report directly. It does not run frame extraction, a separate audio pass, or a second synthesis step.

Multi-frame batch

Used when:

Backend supports batch frame analysis (OpenAI, OpenRouter, Groq, Grok, Cerebras, openai-compat)
Video exceeds direct upload limits

All extracted frames are sent as images in a single API call, with timestamps per frame. The model sees all frames at once with temporal context. One API call instead of N.

Frame-by-frame

Used when:

Backend does not support batch frames (Ollama)
Fallback for any other case

Each extracted frame is analyzed individually with a structured prompt that asks for page/URL, UI state, exact text on screen, error messages, user actions, and what is being demonstrated.

Parallel analysis

Frame-by-frame analysis runs in parallel by default:

API backends: 3 concurrent workers
Ollama: 1 worker (single GPU)

Override with the --parallel flag:

eyeroll watch video.mp4 -p 5

Results are sorted back into frame order after completion.

Caching

eyeroll caches intermediate results, not final reports. This is a deliberate design choice.

What gets cached

Stored in ~/.eyeroll/cache/<key>.json (global). Legacy local .eyeroll/cache/ is checked for backward compatibility.

Frame-by-frame analyses (text per frame)
Direct video analysis text
Audio transcript
Source URL, title, media type, timestamp

TwelveLabs final reports are not cached as reusable intermediates because they include the user/context-specific final synthesis.

What does NOT get cached

The final synthesized report
Context text or codebase context

Why intermediates only

The expensive part is frame analysis (multiple vision API calls). The synthesis step is a single text generation call that is cheap and fast. By caching only intermediates:

You can re-run with different --context without re-analyzing frames
Codebase context changes are reflected immediately
No stale reports -- synthesis always runs fresh

Cache key

The cache key is a SHA-256 hash of:

File content hash (for local files) or URL (for remote sources)
Backend name
Model name

Same file + same backend + same model = cache hit.

Auto-discovery of codebase context

Before synthesis, eyeroll automatically discovers project context from well-known files (CLAUDE.md, AGENTS.md, CURSOR.md, .eyeroll/context.md, etc.). This means you get grounded file paths in reports without any setup. Disable with --no-context.

Cost estimates

After each analysis, eyeroll prints a cost estimate to stderr showing tokens used and approximate USD cost. Suppress with --no-cost. Ollama runs are always free.

Synthesis

The synthesis step combines all signals into a structured report. It receives:

Frame analyses or direct video analysis
Audio transcript
User-provided context text
Codebase context (auto-discovered or from .eyeroll/context.md)

The prompt first classifies the content type (bug report, tutorial, feature demo, feature request, code review, or general notes) based on visual evidence, then adapts the analysis sections accordingly. For bug reports, evidence is categorized into confidence tiers:

Evidence confidence tiers (bug reports)

Tier	Meaning	Example
Visible in recording	Directly observed on screen	"Error toast reads: TypeError: Cannot read properties of undefined"
Informed by codebase context	References real files from the project	"In `src/checkout/handler.py` (from codebase context), the `process_payment` function..."
Hypothesis	Educated guess, not confirmed	"The user object may not have a Stripe customer ID, which would cause this error"

This tiered approach prevents the coding agent from treating guesses as facts. Without codebase context, all file paths are explicitly labeled as hypotheses.

Content-adaptive suggestions

The report's suggested next steps adapt to the content type:

Bug report → investigate and fix, raise a PR
Tutorial → create a reusable skill or automation
Feature demo → document, create notes
Feature request → spec it out, create tasks

Supported inputs

Type	Formats
Video	.mp4, .webm, .mov, .avi, .mkv, .flv, .ts, .m4v, .wmv, .3gp, .mpg, .mpeg
Image	.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .heic, .avif
URL	YouTube, Loom, Vimeo, Twitter/X, Reddit, and 1000+ sites via yt-dlp