Let Claude — or any LLM — actually watch a video.
Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.
claude-real-video does it differently, and locally: point it at a URL or a
file, and it pulls the frames that actually matter (every scene change, not a
fixed quota), throws away the near-duplicates, transcribes the audio, and hands
you a clean folder any LLM can read — on your own machine, nothing uploaded.
crv "https://www.youtube.com/watch?v=..."
# → crv-out/frames/*.jpg + crv-out/transcript.txt + crv-out/MANIFEST.txtThen drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.
Most "let an LLM watch a video" scripts (and Gemini's own pipeline) grab frames
at a fixed interval — e.g. one per second. That over-samples a static
screencast and under-samples a fast-cut reel. claude-real-video is smarter:
| fixed-interval sampling | claude-real-video | |
|---|---|---|
| Frame selection | every N seconds | scene-change detection + density floor |
| Repeated shots (A-B-A cuts) | sent again every time | sliding-window dedup sends each shot once |
| Static slide (10 min) | ~600 near-identical frames | collapses to 1 (dedup) |
| Fast-cut reel | misses frames between samples | catches each visual change |
| Audio | often ignored | Whisper transcript w/ language detect |
| Where the video goes | often uploaded to a cloud | stays on your machine |
| Input | usually local file only | URL (yt-dlp) or local file |
You feed the model fewer, more meaningful frames — cheaper context, better understanding.
pip install claude-real-video # core (frames + dedup)
pip install "claude-real-video[whisper]" # + audio transcriptionffmpeg / ffprobe are used for frame extraction and audio, and aren't
pip-installable. Install them once:
| OS | command |
|---|---|
| macOS | brew install ffmpeg |
| Linux | sudo apt install ffmpeg (or your distro's package manager) |
| Windows | winget install Gyan.FFmpeg — or choco install ffmpeg — or download a build and add its bin\ folder to your PATH |
Verify it's on your PATH:
Transcription uses the whisper CLI (installed by the [whisper] extra, or
pip install openai-whisper). Whisper also relies on ffmpeg.
Works on macOS, Windows, and Linux — Python 3.10+.
# A YouTube / Instagram / TikTok / ... link
crv "https://www.instagram.com/reel/XXXX/"
# A local file, English transcript, output to ./out
crv lecture.mp4 -o out --lang en
# Frames only, no transcription
crv clip.mp4 --no-transcribe
# A login-gated video (your own / authorised use): pass a Netscape cookie file
crv "https://..." --cookies cookies.txtpython -m claude_real_video ... works as an alias for crv too.
| flag | default | meaning |
|---|---|---|
-o, --out |
crv-out |
output directory |
--scene |
0.30 |
scene-change sensitivity (lower = more frames) |
--fps-floor |
1.0 |
at least one frame every N seconds |
--max-frames |
150 |
hard cap on total frames |
--lang |
auto |
Whisper language (en, zh, auto, ...) |
--dedup-threshold |
8 |
% of pixels that must change for a frame to count as new; higher = fewer frames |
--dedup-window |
4 |
compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (1 = consecutive-only) |
--report |
off | keep dropped frames in ./dropped + write report.html visualising every keep/drop decision |
--no-transcribe |
off | skip audio |
--keep-audio |
off | also save the full soundtrack (audio.m4a) so audio models can hear it |
--cookies |
– | Netscape cookie file for login-gated sources |
from claude_real_video import process
r = process("https://youtu.be/...", "out", lang="en")
print(r.frame_count, r.transcript_path)- Fetch —
yt-dlpfor URLs (optional cookies), or copy a local file. - Extract — one chronological
ffmpeg selectpass grabs every scene change plus a density floor (at least one frame every--fps-floorseconds), so fast cuts and slow screencasts are both covered. - Dedup — real pixel difference (downscaled RGB, not a perceptual hash — hashes
go blind on flat colours and equal-luma hue changes) against a sliding window
of the last
--dedup-windowkept frames, so an A-B-A cutaway doesn't re-send a shot the model has already seen.--reportwritesreport.htmlshowing every keep/drop decision with its diff %, for tuning. - Text — if the video already has subtitles (a sidecar
.srt/.vttnext to a local file, or an embedded subtitle track), those are used as the transcript — faster and more accurate than re-transcribing. Only when there are no subtitles does it fall back to Whisper on the audio (skipped cleanly if there's no audio). - Audio (optional,
--keep-audio) — save the full original soundtrack (audio.m4a: music + speech + effects, copied losslessly when possible). The transcript only has the words; the audio file lets a model that can listen (Gemini, GPT-4o, …) actually hear the music and tone. - Manifest —
MANIFEST.txtsummarises everything for the model.
So the model can see (key frames), read (transcript) and — with --keep-audio —
hear (full soundtrack) the video. The transcript is plain text any model can read;
the tool doesn't burn subtitles into the video — burning is a presentation choice,
not something needed to make a video AI-readable.
- Only download content you have the right to. The
--cookiesoption is for your own, authorised access — don't ship credentials in a repo. - Re-running overwrites the output directory.
MIT