镜像：图像和视频生成模型的盲点

镜像：图像和视频生成模型的盲点
Mirrors: The Blind Spot of Image and Video Generation Models

原始链接: https://medium.com/@aliborji/mirrors-the-blind-spot-of-image-and-video-generation-models-de0f39310578

最近的AI图像和视频生成模型擅长创建逼真的视觉效果，但在准确渲染镜面反射方面却始终存在困难。对五个图像生成模型（Gemini、Adobe Firefly、Bing、Ideogram、Freepik）和四个视频生成模型的评估（测试提示包含人和物体）揭示了常见的反射错误。图像模型经常产生扭曲的、不一致的或缺失的反射。Gemini在物体摆放和反射方面判断失误，尤其是在猫、椅子和厨房场景中。Ideogram生成的图像保真度更高，但在手部反射、物体不一致以及群体图像中人脸质量差方面仍然存在问题。Adobe Firefly显示反射错位或缺失，以及物体从镜子中不自然地延伸出来。Bing Image Creator生成的图像卡通化，元素错位或变形。即使是高质量的Freepik图像也存在类似的反射错误。视频模型也存在同样的反射问题，并且难以正确地镜像运动，导致视频不真实且存在缺陷。这些问题突出了在实现真正逼真和物理上合理的图像和视频合成方面的一大挑战。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录镜像：图像和视频生成模型的盲点 (medium.com/aliborji) 4 分 yamrzou 1小时前 | 隐藏 | 过去 | 收藏 | 讨论加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

原文

Recent advances in image generation models have demonstrated remarkable capabilities in creating photorealistic and imaginative visuals. However, a persistent challenge remains: accurately rendering reflections in mirrors. We anecdotally evaluate five image generation models and four video generation models using five prompts featuring both humans and objects. Our findings reveal that AI models frequently struggle with reflections, often generating distorted, inconsistent, or entirely incorrect images. Here is the data.

Image generated by Gemini for the prompt: “An image of two cats playing in front of a mirror”

Generative image models, particularly those based on deep learning, have achieved impressive results in synthesizing realistic images of various scenes and objects. From generating human faces to creating fantastical landscapes, these models have shown a remarkable ability to learn complex data distributions and produce novel content. However, despite their progress, a seemingly simple element — the mirror — continues to pose a significant challenge. Reflections, governed by the precise laws of optics, often appear distorted, misplaced, or entirely absent in generated images. This article explores how mirrors pose a significant challenge for generative models and suggests that addressing this blind spot is crucial to achieve more realistic and physically plausible image synthesis.

We chose a range of generative models to assess how effectively popular image and video generation models can synthesize content with accurate mirror reflections. These models are readily available to the public.

Image generation models

We evaluated five image generation models including:

Gemini which uses Imagen 3 as its generation backbone
Adobe Firefly
Bing which uses DALL-E 3
Ideogram
Freepik.com

These models were evaluated using the following prompts, some featuring humans and others containing only objects.

An image of a young lady holding a pen in front of a mirror
An image of two cats playing in front of a mirror
An image of a chair in front of a mirror
An image of a group of people in a room with a mirror in it
An image of a kitchen with a mirror in it

The results from various models (some examples are shown below) show consistent patterns of reflection and perspective issues. The Gemini model struggles with incorrect or missing reflections and misjudged object placements, particularly with cats, chairs, and kitchen scenes. Some errors are subtle but noticeable.

The Ideogram model generally produces higher fidelity images, but also faces recurring issues. Hand reflections are often incorrect, and objects can appear inconsistently reflected. It particularly struggles with group images and faces, making significant errors in reflections and image coherence. Quality of faces in group images is poor.

Adobe Firefly has more severe errors, such as objects extending unnaturally outside mirrors and misaligned or missing reflections, leading to reduced realism.

Bing Image Creator often produces cartoonish images with significant reflection issues, misplacing or distorting elements.

Freepik-generated cat images show high visual quality but still suffer from similar reflection errors, highlighting a common challenge across models.

Image generated by Ideogram for the prompt: “An image of two cats playing in front of a mirror”

Image generated by Ideogram for the prompt: “An image of a chair in front of a mirror”

Image generated by Ideogram for the prompt: “An image of a kitchen with a mirror in it”

Image generated by Ideogram for the prompt: “An image of a young lady holding a pen in front of a mirror”

Image generated by Ideogram for the prompt: “An image of a group of people in a room with a mirror in it”

Image generated by Adobe for the prompt: “An image of a young lady holding a pen in front of a mirror”

High-resolution versions of the generated images are available on the GitHub page associated with this article for further examination.

Video generation models

Additionally, we evaluated the following text-to-video generation models using only the first prompt from the previous subsection.

veed.io
pollo.ai (poolo 1.5)
ltx.studio
vidnoz.com

These models exhibit similar issues to those observed in the image generation models. In addition to errors in appearance and consistency, they also struggle with accurately generating motion in reflections. Reflected elements often move incorrectly or fail to correspond to the real-world physics of mirrored motion, further degrading the realism of the generated videos. As a result, their overall performance in handling reflections is particularly poor, making the generated videos noticeably flawed.