```Claude-real-video - 任何大语言模型都能观看视频```
Claude-real-video - any LLM can watch a video

原始链接: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

**claude-real-video** 是一款可以让大语言模型获取高质量、有意义的视频摘要的工具。与那些仅按固定间隔采样帧的标准方法不同——那种方法往往会漏掉快速剪辑或因重复的静态图像导致上下文溢出——该工具可以在本地对视频进行智能处理。 **主要功能包括:** * **场景感知提取:** 它能识别真实的场景切换,并使用密度下限来确保无论视频节奏如何,都能实现全面覆盖。 * **智能去重:** 它会移除近乎相同的帧,并避免重复发送模型已经看过的镜头(例如 A-B-A 剪辑),从而节省上下文空间。 * **音频集成:** 它可以通过 Whisper 生成转录文本或提取现有字幕,并能保存原始音轨供多模态模型使用。 * **隐私与高效:** 所有处理均在本地运行,无需上传至云端。 只需提供 URL 或本地文件,该工具就会输出一个包含关键帧、转录文本和清单的整洁文件夹。之后,您可以将这些文件放入 ChatGPT、Claude 或 Gemini 中,进行准确且高效的上下文分析。该工具需要 `ffmpeg` 支持,并可在 Windows、macOS 和 Linux 上运行。

`claude-real-video` 是一款全新的开源 Python 工具,旨在优化大语言模型(LLM)处理视频内容的方式。开发者针对现有技术的局限性(例如 ChatGPT 依赖字幕、Gemini 采用低效的固定间隔采样),打造了一个更智能的预处理流程。 该工具通过以下方式优化视频以适配 LLM: * **场景感知提取**:根据实际场景变化而非固定时间间隔来识别帧。 * **智能去重**:利用滑动窗口像素差算法忽略冗余镜头(例如重复的访谈过场画面)。 * **全面转录**:提取内嵌字幕或使用 Whisper 处理音频。 * **高效处理**:将一段 10 分钟的视频从数百帧精简为仅 5-15 帧关键帧,在提升理解力的同时节省超过 90% 的 Token 用量。 用户可以生成一个“MANIFEST.txt”文件供任何 LLM 使用,或通过 `--report` 功能直观地审查哪些帧被保留或丢弃。该工具采用 MIT 许可证,需安装 `ffmpeg`,并可通过 `pip` 安装。
相关文章

原文

Let Claude — or any LLM — actually watch a video.

Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.

claude-real-video does it differently, and locally: point it at a URL or a file, and it pulls the frames that actually matter (every scene change, not a fixed quota), throws away the near-duplicates, transcribes the audio, and hands you a clean folder any LLM can read — on your own machine, nothing uploaded.

crv "https://www.youtube.com/watch?v=..."
# → crv-out/frames/*.jpg  +  crv-out/transcript.txt  +  crv-out/MANIFEST.txt

Then drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.


Why not just sample frames?

Most "let an LLM watch a video" scripts (and Gemini's own pipeline) grab frames at a fixed interval — e.g. one per second. That over-samples a static screencast and under-samples a fast-cut reel. claude-real-video is smarter:

fixed-interval sampling claude-real-video
Frame selection every N seconds scene-change detection + density floor
Repeated shots (A-B-A cuts) sent again every time sliding-window dedup sends each shot once
Static slide (10 min) ~600 near-identical frames collapses to 1 (dedup)
Fast-cut reel misses frames between samples catches each visual change
Audio often ignored Whisper transcript w/ language detect
Where the video goes often uploaded to a cloud stays on your machine
Input usually local file only URL (yt-dlp) or local file

You feed the model fewer, more meaningful frames — cheaper context, better understanding.


pip install claude-real-video              # core (frames + dedup)
pip install "claude-real-video[whisper]"   # + audio transcription

System requirement: ffmpeg

ffmpeg / ffprobe are used for frame extraction and audio, and aren't pip-installable. Install them once:

OS command
macOS brew install ffmpeg
Linux sudo apt install ffmpeg (or your distro's package manager)
Windows winget install Gyan.FFmpeg — or choco install ffmpeg — or download a build and add its bin\ folder to your PATH

Verify it's on your PATH:

Transcription uses the whisper CLI (installed by the [whisper] extra, or pip install openai-whisper). Whisper also relies on ffmpeg.

Works on macOS, Windows, and Linux — Python 3.10+.


# A YouTube / Instagram / TikTok / ... link
crv "https://www.instagram.com/reel/XXXX/"

# A local file, English transcript, output to ./out
crv lecture.mp4 -o out --lang en

# Frames only, no transcription
crv clip.mp4 --no-transcribe

# A login-gated video (your own / authorised use): pass a Netscape cookie file
crv "https://..." --cookies cookies.txt

python -m claude_real_video ... works as an alias for crv too.

flag default meaning
-o, --out crv-out output directory
--scene 0.30 scene-change sensitivity (lower = more frames)
--fps-floor 1.0 at least one frame every N seconds
--max-frames 150 hard cap on total frames
--lang auto Whisper language (en, zh, auto, ...)
--dedup-threshold 8 % of pixels that must change for a frame to count as new; higher = fewer frames
--dedup-window 4 compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (1 = consecutive-only)
--report off keep dropped frames in ./dropped + write report.html visualising every keep/drop decision
--no-transcribe off skip audio
--keep-audio off also save the full soundtrack (audio.m4a) so audio models can hear it
--cookies Netscape cookie file for login-gated sources

from claude_real_video import process

r = process("https://youtu.be/...", "out", lang="en")
print(r.frame_count, r.transcript_path)

  1. Fetchyt-dlp for URLs (optional cookies), or copy a local file.
  2. Extract — one chronological ffmpeg select pass grabs every scene change plus a density floor (at least one frame every --fps-floor seconds), so fast cuts and slow screencasts are both covered.
  3. Dedup — real pixel difference (downscaled RGB, not a perceptual hash — hashes go blind on flat colours and equal-luma hue changes) against a sliding window of the last --dedup-window kept frames, so an A-B-A cutaway doesn't re-send a shot the model has already seen. --report writes report.html showing every keep/drop decision with its diff %, for tuning.
  4. Text — if the video already has subtitles (a sidecar .srt/.vtt next to a local file, or an embedded subtitle track), those are used as the transcript — faster and more accurate than re-transcribing. Only when there are no subtitles does it fall back to Whisper on the audio (skipped cleanly if there's no audio).
  5. Audio (optional, --keep-audio) — save the full original soundtrack (audio.m4a: music + speech + effects, copied losslessly when possible). The transcript only has the words; the audio file lets a model that can listen (Gemini, GPT-4o, …) actually hear the music and tone.
  6. ManifestMANIFEST.txt summarises everything for the model.

So the model can see (key frames), read (transcript) and — with --keep-audiohear (full soundtrack) the video. The transcript is plain text any model can read; the tool doesn't burn subtitles into the video — burning is a presentation choice, not something needed to make a video AI-readable.


  • Only download content you have the right to. The --cookies option is for your own, authorised access — don't ship credentials in a repo.
  • Re-running overwrites the output directory.

MIT

联系我们 contact @ memedata.com