展示 HN: Infinity – 会说话的逼真 AI 角色 Show HN: Infinity – Realistic AI characters that can speak

原始链接: https://news.ycombinator.com/item?id=41467704

Infinity AI 推出了一种名为 Inf2 的新基础视频模型,它可以创建由音频输入生成的栩栩如生、会说话的角色。 与其他人工智能工具不同,Inf2 使用户能够创建独特的语音角色,而不会因面部表情和手势不匹配而产生奇怪的感觉。 用户可以编写详细说明角色对话和动作的脚本,然后接收返回的视频。 目前,该模型专注于人形图像,但将来可能会扩展到包括不同类型的视觉元素。 Inf2 背后的团队在大约 11 年的 GPU 培训期间投资了超过 50 万美元用于其开发,预计还会有进一步的改进。 然而,模型在某些领域存在困难,例如处理动物、添加不需要的手势、处理卡通图像时缺乏鲁棒性,以及维持身份准确性的问题,特别是对于可识别的人物。 您可以通过访问该平台。 该团队欢迎反馈。

Infinity AI introduces a new foundation video model called Inf2, which creates lifelike, speaking characters generated by audio input. Unlike other AI tools, Inf2 enables users to create unique, spoken characters without the uncanny feel caused by mismatched facial expressions and gestures. Users can write scripts detailing dialogue and actions for their characters, then receive videos back. Currently, the model focuses on humanoid images but may expand to include different types of visual elements in the future. The team behind Inf2 has invested over $500K in its development during approximately eleven GPU years of training, with further improvements expected. However, there are areas where the model struggles, such as handling animals, adding unwanted hand gestures, lack of robustness when dealing with cartoonish imagery, and issues maintaining identity accuracy particularly for recognizable figures. You can access the platform via . The team welcomes feedback.


Hey HN, this is Lina, Andrew, and Sidney from Infinity AI (https://infinity.ai/). We've trained our own foundation video model focused on people. As far as we know, this is the first time someone has trained a video diffusion transformer that’s driven by audio input. This is cool because it allows for expressive, realistic-looking characters that actually speak. Here’s a blog with a bunch of examples: https://toinfinityai.github.io/v2-launch-page/

If you want to try it out, you can either (1) go to https://studio.infinity.ai/try-inf2, or (2) post a comment in this thread describing a character and we’ll generate a video for you and reply with a link. For example: “Mona Lisa saying ‘what the heck are you smiling at?’”: https://bit.ly/3z8l1TM “A 3D pixar-style gnome with a pointy red hat reciting the Declaration of Independence”: https://bit.ly/3XzpTdS “Elon Musk singing Fly Me To The Moon by Sinatra”: https://bit.ly/47jyC7C

Our tool at Infinity allows creators to type out a script with what they want their characters to say (and eventually, what they want their characters to do) and get a video out. We’ve trained for about 11 GPU years (~$500k) so far and our model recently started getting good results, so we wanted to share it here. We are still actively training.

We had trouble creating videos of good characters with existing AI tools. Generative AI video models (like Runway and Luma) don’t allow characters to speak. And talking avatar companies (like HeyGen and Synthesia) just do lip syncing on top of the previously recorded videos. This means you often get facial expressions and gestures that don’t make sense with the audio, resulting in the “uncanny” look you can’t quite put your finger on. See blog.

When we started Infinity, our V1 model took the lip syncing approach. In addition to mismatched gestures, this method had many limitations, including a finite library of actors (we had to fine-tune a model for each one with existing video footage) and an inability to animate imaginary characters.

To address these limitations in V2, we decided to train an end-to-end video diffusion transformer model that takes in a single image, audio, and other conditioning signals and outputs video. We believe this end-to-end approach is the best way to capture the full complexity and nuances of human motion and emotion. One drawback of our approach is that the model is slow despite using rectified flow (2-4x speed up) and a 3D VAE embedding layer (2-5x speed up).

Here are a few things the model does surprisingly well on: (1) it can handle multiple languages, (2) it has learned some physics (e.g. it generates earrings that dangle properly and infers a matching pair on the other ear), (3) it can animate diverse types of images (paintings, sculptures, etc) despite not being trained on those, and (4) it can handle singing. See blog.

Here are some failure modes of the model: (1) it cannot handle animals (only humanoid images), (2) it often inserts hands into the frame (very annoying and distracting), (3) it’s not robust on cartoons, and (4) it can distort people’s identities (noticeable on well-known figures). See blog.

Try the model here: https://studio.infinity.ai/try-inf2

We’d love to hear what you think!

相关文章
联系我们 contact @ memedata.com