（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41467704

Infinity AI 推出了新的视频扩散变压器模型，该模型专注于通过音频输入生成逼真的人形角色。与传统的生成式人工智能视频模型或会说话的化身不同，该技术可以在合成角色中实现富有表现力的语音和动作。用户可以在Infinity的平台上创建独特的角色并为其编写脚本，系统将制作相应的视频。例子包括蒙娜丽莎的讲话、皮克斯风格的侏儒背诵美国独立宣言，以及埃隆·马斯克演唱“带我飞向月球”。该团队声称花费了大约 50 万美元和 11 年的计算资源来训练该模型，但仍在继续提高性能。以前创建熟悉的视频角色的尝试面临着诸如面部表情和音频不匹配、演员库有限以及难以为虚构角色制作动画等问题。为了克服这些挑战，Infinity 设计了一种端到端视频扩散转换器模型，该模型在输出视频帧之前接受单个图像、音频和附加上下文信息。尽管由于某些优化而比预期慢，但该设计有效地捕捉了复杂的人类动作和情绪。值得注意的是，该模型支持各种语言，表现出对物理定律的一定理解（例如，正确摆动耳环），并且无需事先训练即可描绘不同的媒体类型，例如绘画和雕塑。然而，当前版本存在局限性，例如缺乏对动物描绘的支持、频繁的手部侵入场景、处理卡通图像时的弱点以及潜在的身份扭曲。在此处尝试该模型：[Infinity Studio 链接] 作者邀请对其创新提供反馈。

原文

Hey HN, this is Lina, Andrew, and Sidney from Infinity AI (https://infinity.ai/). We've trained our own foundation video model focused on people. As far as we know, this is the first time someone has trained a video diffusion transformer that’s driven by audio input. This is cool because it allows for expressive, realistic-looking characters that actually speak. Here’s a blog with a bunch of examples: https://toinfinityai.github.io/v2-launch-page/

If you want to try it out, you can either (1) go to https://studio.infinity.ai/try-inf2, or (2) post a comment in this thread describing a character and we’ll generate a video for you and reply with a link. For example: “Mona Lisa saying ‘what the heck are you smiling at?’”: https://bit.ly/3z8l1TM “A 3D pixar-style gnome with a pointy red hat reciting the Declaration of Independence”: https://bit.ly/3XzpTdS “Elon Musk singing Fly Me To The Moon by Sinatra”: https://bit.ly/47jyC7C

Our tool at Infinity allows creators to type out a script with what they want their characters to say (and eventually, what they want their characters to do) and get a video out. We’ve trained for about 11 GPU years (~$500k) so far and our model recently started getting good results, so we wanted to share it here. We are still actively training.

We had trouble creating videos of good characters with existing AI tools. Generative AI video models (like Runway and Luma) don’t allow characters to speak. And talking avatar companies (like HeyGen and Synthesia) just do lip syncing on top of the previously recorded videos. This means you often get facial expressions and gestures that don’t make sense with the audio, resulting in the “uncanny” look you can’t quite put your finger on. See blog.

When we started Infinity, our V1 model took the lip syncing approach. In addition to mismatched gestures, this method had many limitations, including a finite library of actors (we had to fine-tune a model for each one with existing video footage) and an inability to animate imaginary characters.

To address these limitations in V2, we decided to train an end-to-end video diffusion transformer model that takes in a single image, audio, and other conditioning signals and outputs video. We believe this end-to-end approach is the best way to capture the full complexity and nuances of human motion and emotion. One drawback of our approach is that the model is slow despite using rectified flow (2-4x speed up) and a 3D VAE embedding layer (2-5x speed up).

Here are a few things the model does surprisingly well on: (1) it can handle multiple languages, (2) it has learned some physics (e.g. it generates earrings that dangle properly and infers a matching pair on the other ear), (3) it can animate diverse types of images (paintings, sculptures, etc) despite not being trained on those, and (4) it can handle singing. See blog.

Here are some failure modes of the model: (1) it cannot handle animals (only humanoid images), (2) it often inserts hands into the frame (very annoying and distracting), (3) it’s not robust on cartoons, and (4) it can distort people’s identities (noticeable on well-known figures). See blog.

Try the model here: https://studio.infinity.ai/try-inf2

We’d love to hear what you think!

（评论） (comments)

（评论）
(comments)