Netflix Void 模型：视频对象和交互删除

Netflix Void 模型：视频对象和交互删除
VOID: Video Object and Interaction Deletion

原始链接: https://github.com/Netflix/void-model

## VOID：逼真视频物体与交互移除 VOID是一个新的AI系统，能够从视频中移除物体，并逼真地模拟移除后的后果——例如，吉他手被移除时吉他会掉落。VOID基于CogVideoX，利用两次传递的Transformer流程进行高质量的视频修复，并以感知交互的掩码为条件。 **工作原理：** VOID首先生成掩码，识别要移除的物体以及受其存在影响的区域（例如，掉落的物体）。然后，它在两次推理过程中使用这些掩码：Pass 1执行基础修复，而Pass 2使用扭曲噪声来优化时间一致性。 **关键要求：** VOID需要大量的计算资源——需要具有40GB+ VRAM的GPU（例如A100）。它还需要安装SAM2等依赖项，并设置Gemini API密钥以生成掩码。 **数据与训练：** 虽然提供预训练模型，但由于许可问题，VOID的训练数据并未直接发布。相反，提供了使用HUMOTO和Kubic生成训练数据（创建包含和不包含物体的反事实视频）的代码。 **可用性：** 模型和代码可在HuggingFace和GitHub上获得，从而促进社区贡献和进一步开发。 ([https://arxiv.org/abs/2604.02296](https://arxiv.org/abs/2604.02296))

Netflix Void 模型：视频对象和交互删除 (github.com/netflix) 26 分，bobsoap 1 小时前 | 隐藏 | 过去 | 收藏 | 3 条评论帮助 fraywing 2 分钟前 | 下一个 [–] 1. 先拍一次 2. 审查中国不喜欢的内容（无需重拍） 3. ??? 4. 盈利/降低成本回复 twoodfin 4 分钟前 | 上一个 | 下一个 [–] 将这种现代魔法应用于历史和艺术是可怕的。简直是真理部的梦想！ Netflix 显然想从其目录中删除吸烟或其他纸浆斯大林主义。哦，好吧，一个不错的自动回归剧场。回复 echelon 6 分钟前 | 上一个 [–] CogVideoX 仍然是一个学术强国模型。很多论文都建立在这个小东西之上。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.

Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.

teaser-with-name.mp4

VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.

Model	Description	HuggingFace
VOID Pass 1	Base inpainting model	Download
VOID Pass 2	Warped-noise refinement model	Download

Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).

The fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:

Note: Requires a GPU with 40GB+ VRAM (e.g., A100).

For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.

pip install -r requirements.txt

Stage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:

export GEMINI_API_KEY=your_key_here

Also install SAM2 separately (required for mask generation):

git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e .

Download the pretrained base inpainting model from HuggingFace:

hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

The inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.

If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:

ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg

📁 Expected directory structure

After cloning the repo and downloading all assets, your directory should look like this:

VOID/
├── config/
├── datasets/
│   └── void_train_data.json
├── inference/
├── sample/                         # included sample sequences for inference
├── scripts/
├── videox_fun/
├── VLM-MASK-REASONER/
├── README.md
├── requirements.txt
│
├── CogVideoX-Fun-V1.5-5b-InP/     # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
├── void_pass1.safetensors          # download from huggingface.co/void-model (see Models above)
├── void_pass2.safetensors          # download from huggingface.co/void-model (see Models above)
├── training_data/                  # generated via data_generation/ pipeline (see Training section)
└── data_generation/                # data generation code (HUMOTO + Kubric pipelines)

Each video sequence lives in its own folder under a root data directory:

data_rootdir/
└── my-video/
    ├── input_video.mp4      # source video
    ├── quadmask_0.mp4       # quadmask (4-value mask video, see below)
    └── prompt.json          # {"bg": "background description"}

The prompt.json contains a single "bg" key describing the scene after the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.

{ "bg": "A table with a cup on it." }         // ✅ describes the clean background
{ "bg": "A person being removed from scene." } // ❌ don't describe the removal

A few examples from the included samples:

Sequence	Removed object	`bg` prompt
`lime`	the glass	`"A lime falls on the table."`
`moving_ball`	the rubber duckie	`"A ball rolls off the table."`
`pillow`	the kettlebell being placed on the pillow	`"Two pillows are on the table."`

The quadmask encodes four semantic regions per pixel:

Value	Meaning
`0`	Primary object to remove
`63`	Overlap of primary + affected regions
`127`	Affected region (interactions: falling objects, displaced items, etc.)
`255`	Background (keep)

🎭 Stage 1 — Generate Masks

The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.

🖱️ Step 0 — Select points (GUI)

python VLM-MASK-REASONER/point_selector_gui.py

Load a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.

Config format:

{
  "videos": [
    {
      "video_path": "path/to/video.mp4",
      "output_dir": "path/to/output/folder",
      "instruction": "remove the person"
    }
  ]
}

⚡ Steps 1–4 — Run the full pipeline

After saving the points config, run all remaining stages automatically:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json

Optional flags:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
    --sam2-checkpoint path/to/sam2_hiera_large.pt \
    --device cuda

This runs the following stages in order:

Stage	Script	Output
1 — SAM2 segmentation	`stage1_sam2_segmentation.py`	`black_mask.mp4`
2 — VLM analysis	`stage2_vlm_analysis.py`	`vlm_analysis.json`
3 — Grey mask generation	`stage3a_generate_grey_masks_v2.py`	`grey_mask.mp4`
4 — Combine into quadmask	`stage4_combine_masks.py`	`quadmask_0.mp4`

The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.

🎬 Stage 2 — Inference

VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.

✨ Pass 1 — Base inference

python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="path/to/data_rootdir" \
    --config.experiment.run_seqs="my-video" \
    --config.experiment.save_path="path/to/output" \
    --config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
    --config.video_model.transformer_path="path/to/void_pass1.safetensors"

To run multiple sequences at once, pass a comma-separated list:

--config.experiment.run_seqs="video1,video2,video3"

Key config options:

Flag	Default	Description
`--config.data.sample_size`	`384x672`	Output resolution (HxW)
`--config.data.max_video_length`	`197`	Max frames to process
`--config.video_model.temporal_window_size`	`85`	Temporal window for multidiffusion
`--config.video_model.num_inference_steps`	`50`	Denoising steps
`--config.video_model.guidance_scale`	`1.0`	Classifier-free guidance scale
`--config.system.gpu_memory_mode`	`model_cpu_offload_and_qfloat8`	Memory mode (`model_full_load`, `model_cpu_offload`, `sequential_cpu_offload`)

The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.

🔁 Pass 2 — Warped noise refinement

Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.

Single video:

python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
    --video_name my-video \
    --data_rootdir path/to/data_rootdir \
    --pass1_dir path/to/pass1_outputs \
    --output_dir path/to/pass2_outputs \
    --model_checkpoint path/to/void_pass2.safetensors \
    --model_name path/to/CogVideoX-Fun-V1.5-5b-InP

Batch: Edit the video list and paths in inference/pass_2_refine.sh, then run:

bash inference/pass_2_refine.sh

Key arguments:

Argument	Default	Description
`--pass1_dir`	—	Directory containing Pass 1 output videos
`--output_dir`	`./inference_with_warped_noise`	Where to save Pass 2 results
`--warped_noise_cache_dir`	`./pass1_warped_noise_cache`	Cache for precomputed warped latents
`--temporal_window_size`	`85`	Temporal window size
`--height` / `--width`	`384` / `672`	Output resolution
`--guidance_scale`	`6.0`	CFG scale
`--num_inference_steps`	`50`	Denoising steps
`--use_quadmask`	`True`	Use quadmask conditioning

✏️ Stage 3 — Manual Mask Refinement (Optional)

If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.

python VLM-MASK-REASONER/edit_quadmask.py

Open a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.

Tools:

Grid Toggle — click a grid cell to toggle the interaction region (127 ↔ 255)
Grid Black Toggle — click a grid cell to toggle the primary object region (0 ↔ 255)
Brush (Add / Erase) — freehand paint or erase mask regions at pixel level
Copy from Previous Frame — propagate the black or grey mask from the previous frame

Keyboard shortcuts: ← / → navigate frames, Ctrl+Z / Ctrl+Y undo/redo.

Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.

🏋️ Training

Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:

Source 1: HUMOTO (Human-Object Interaction)

Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.

Prerequisites:

HUMOTO dataset — Request access from the authors at adobe-research/humoto. Once approved, download and place under data_generation/humoto_release/
Blender — Install Blender (tested with 3.x and 4.x). Also install opencv-python-headless in Blender's Python (see data_generation/README.md)
Remy & Sophie characters — Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", download each as FBX, and place at:
```
data_generation/human_model/Remy_mixamo_bone.fbx
data_generation/human_model/Sophie_mixamo_bone.fbx
```
PBR textures (optional) — Download texture packs from ambientCG or Poly Haven. Without textures, objects render with realistic solid colors as fallback

Expected directory structure after setup:

data_generation/
├── humoto_release/
│   ├── humoto_0805/                    # HUMOTO sequences (.pkl, .fbx, .yaml per sequence)
│   └── humoto_objects_0805/            # Object meshes (.obj, .fbx per object)
├── human_model/
│   ├── Remy_mixamo_bone.fbx            # ← download from Mixamo
│   ├── Sophie_mixamo_bone.fbx          # ← download from Mixamo
│   ├── bone_names.py                   # included
│   └── *.json                          # included (bone structure definitions)
├── textures/                           # ← optional, user-provided PBR textures
├── physics_config.json                 # included (manual per-sequence physics settings)
├── render_paired_videos_blender_quadmask.py   # main renderer
├── convert_split_remy_sophie.sh               # character conversion script
└── ...

Pipeline:

cd data_generation

# 1. Convert HUMOTO sequences to Remy/Sophie characters
bash convert_split_remy_sophie.sh

# 2. Render paired videos (with human, without human, quad-mask)
blender --background --python render_paired_videos_blender_quadmask.py -- \
    -d ./humoto_release/humoto_0805 \
    -o ./output \
    -s <sequence_name> \
    -m ./humoto_release/humoto_objects_0805 \
    --use_characters --enable_physics --add_walls \
    --target_frames 60 --fps 12

A pre-configured physics_config.json is included specifying which objects are static vs. dynamic per sequence. See data_generation/README.md for full details.

Source 2: Kubric (Object-Only Interaction)

Generates counterfactual videos using Kubric with Google Scanned Objects. Objects are launched at a target; removing them alters the target's physics trajectory. No external dataset download required — assets are fetched from Google Cloud Storage.

cd data_generation
pip install kubric pybullet imageio imageio-ffmpeg

python kubric_variable_objects.py --num_pairs 200 --resolution 384

Both pipelines output the same format expected by the training scripts:

training_data/
└── sequence_name/
    ├── rgb_full.mp4       # input video (with object)
    ├── rgb_removed.mp4    # target video (object removed, physics applied)
    ├── mask.mp4           # quad-mask (0/63/127/255)
    └── metadata.json

Point the training scripts at your generated data by updating datasets/void_train_data.json.

Training proceeds in two stages. Pass 1 is trained first, then Pass 2 fine-tunes from that checkpoint.

Pass 1 — Base inpainting model

Does not require warped noise. Trains the model to remove objects and their interactions from scratch.

bash scripts/cogvideox_fun/train_void.sh

Key arguments:

Argument	Description
`--pretrained_model_name_or_path`	Path to base CogVideoX inpainting model
`--transformer_path`	Optional starting checkpoint
`--train_data_meta`	Path to dataset metadata JSON
`--train_mode="void"`	Enables void inpainting training mode
`--use_quadmask`	Trains with 4-value quadmask conditioning
`--use_vae_mask`	Encodes mask through VAE
`--output_dir`	Where to save checkpoints
`--num_train_epochs`	Number of epochs
`--checkpointing_steps`	Save a checkpoint every N steps
`--learning_rate`	Default `1e-5`

Pass 2 — Warped noise refinement model

Continues training from a Pass 1 checkpoint with optical flow-warped latent initialization, improving temporal consistency on longer videos. Requires warped noise for training data to be present.

bash scripts/cogvideox_fun/train_void_warped_noise.sh

Set TRANSFORMER_PATH to your Pass 1 checkpoint before running:

TRANSFORMER_PATH=path/to/pass1_checkpoint.safetensors bash scripts/cogvideox_fun/train_void_warped_noise.sh

Additional arguments specific to this stage:

Argument	Description
`--use_warped_noise`	Enables warped latent initialization during training
`--warped_noise_degradation`	Noise blending factor (default `0.3`)
`--warped_noise_probability`	Fraction of steps using warped noise (default `1.0`)

Training was run on 8× A100 80GB GPUs using DeepSpeed ZeRO stage 2.

We are excited to see the community build on VOID!
Below we showcase selected demos, tools, and extensions.

If you’ve built something using VOID, feel free to submit a PR to add it here.

This implementation builds on code and models from aigc-apps/VideoX-Fun, Gen-Omnimatte, Go-with-the-Flow, Kubric and HUMOTO. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX, Gen-Omnimatte, and the optical flow warping utilities.

If you find our work useful, please consider citing:

🔗 https://arxiv.org/abs/2604.02296

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}