（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41130042

图像生成，特别是深度学习模型中的图像生成，预计将成为当前一代人工智能 (AI) 的主要应用。与自然语言处理不同，这些模型的局限性（称为“幻觉”）被认为是有益的而不是有害的。这些模型可以明显地生成真实且无偏差的输出，从而无需进行复杂的统计测试。人类可以根据技术能力直观地评估生成的图像，这与文本不同，在文本中，机器生成的内容听起来流畅并不能保证智能。即使使用不完美、嘈杂、训练不足或训练过度的方法，它们仍然可以为各种艺术努力提供价值。完美是不必要的，错误或扭曲可以通过迭代来识别和纠正。一致性是可选的，尽管它可能会导致视频制作等领域的重大进步。 LoRA 等技术使经验不足的人能够轻松训练针对特定角色、风格或概念的模型。在过去的一年里，图像/视觉生成模型的质量有了显着提高，相信发展速度可能不会放缓。人工智能技术不会取代摄影或摄像等领域的专业人士，而是可以作为一种强大的工具提供独特的功能，例如通过简单的文本提示添加或删除图像中的概念。这项技术已经在激励新一代爱好者创作非凡的作品，类似于 Photoshop 在 20 世纪 90 年代的影响。虽然大型语言模型缺乏防止谎言的机构和意图，但由于它们有信心生成可能不正确或误解的信息，因此它们制造错误信息的能力很高。用户应该将人工智能视为有用的助手，而不是取代人类创造力和批判性思维的简单解决方案。尽管技术不断进步，但在获得一致、准确和适应性强的结果方面仍然存在挑战。最终目标是开发能够理解概念而不仅仅是形式和表示的模型，从而在生成图像时具有更大的灵活性，特别是考虑到视点或尺寸的变化。一个有前途的途径可能涉及开发文本到纹理的 3D 网格模型，支持创建程序生成的艺术作品，并确保整个建模过程中的一致性检查。这种水平的进步将为人工智能的创造性表达提供前所未有的可能性。然而，在那之前，现有的图像到图像模型主要依赖于以前风格的分层

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.

* anyone can easily see the unrealistic and biased outputs without complex statistical tests.

* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)

* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.

* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.

* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).

* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.

I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.

> For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.

Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.

It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.

Also the other way around: text to speech. We're at the point where I can finally listen to computer generated voice for extended periods of time without fatigue.

There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.

I’d add image to text - I use this all the time. For instance I’ll take a photo of a board or device and ChatGPT/claude/pick your frontier multi modal is almost always able to classify it accurately and describe details, including chipsets, pinouts, etc.

I tried using ChatGPT for some handwritten text I couldn't make out and it failed miserably, just made stuff up.

Tried it on a PDF and it didn't even read the PDF.

I'm sure we'll get there but.. real shame it lies when it can't figure something out

Are you using 4o?

First lying requires agency and intent, which LLMs don’t have and they can’t lie.

Yes it makes stuff up when you put garbage in and uncritically consume the garbage. The key isn’t to look at it as an outsourcing of agency or the easy button but as a tool that gets you started on stuff and a new way of interacting with computers. It also confidently asserts things that are untrue or are subtly off base. To that extent, and in a literally very real sense, this is a very early preview of the technology - of a completely new computing technique that only reached bare minimum usability in the last two years. Would you rather not have early access or have to wait 20 years as accountants and product managers strangle it?

For OCR I’m surprised anyone who has ever used it before would scan illegible hand writing in and expect to not get a bunch of garbage out without it identifying the garbage was semantically wrong. Frontier Multimodal LLMs do an amazing job - compared to the state of the art a year ago. Do they do an amazing job compared to an ever shifting goal post? Are all the guard rails of a mature 30 year old software technique even discovered yet? No. But I’ll tell you from the early days of things, the early days of HTTP was nothing like today. Was HTTP useless because it was so unreliable and flakey? No it was amazing for those with the patience and capacity to dream to building something truly remarkable at the time, like Google or Amazon or eBay.

The PDF issue you had is not expected. I upload PDFs all the time. For instance when I’m working on something, like restringing some hunter Douglas blinds in my house recently, I upload the instructions for the restring kit to a ChatGPT session or Claude and it then becomes something I can ask iteratively how to tackle what I’m working on as I get to challenge spots in the process. It’s not always right and if confidently tells me subtly wrong things. But I pretty quickly realize what’s wrong and isn’t as I work and that’s usually something ambiguous in the instructions and requires a lot more context on something very specific and likely not documented publicly anywhere. But 80% of the time my questions get answered as I work. That’s -amazing- that I can scan a paper instruction sheet into a computer and get step by step guidance that I can interactively interrogate using my voice as I work and it literally understands everything I ask and gives me cogent if sometimes off answers. This is like literally the definition of the future I was promised.

I agree. I think it's more of a niche use-case than image models (and fundamentally harder to evaluate), but transcription and summarization is my current front-runner for winning use-case of LLMs.

That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.

Do they have a local model?

I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.

> This general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI

I think it's easy to totally miss that LLMs are just being completely and quietly subsumed into a ton of products. They have been far more successful, and many image generation models use LLMs on the backend to generate "better" prompts for the models themselves. LLMs are the bedrock

> it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc.

I'd refrain from making any such statements about the future;* the pace of change makes it hard to see the horizon beyond a few years, especially relative to the span of a career. It's already wholesale-replacing many digital artists and editorial illustrators, and while it's still early, there's a clear push starting in the cinematography direction. (I fully agree with the rest of your comment, and it's strange how much diffusion models seem to be overlooked relative to LLMs when people think about AI progress these days.)

* (edit: about the future impact of AI on jobs).

I mean, my whole comment is a prediction of the future, so that's water under the bridge. Maybe you're right and this is the start of the apocalypse for digital artists, but it feels more like photoshop in 1990 to me -- and people were saying the same stuff back then.

> It's already wholesale-replacing many digital artists and editorial illustrators

I think you're going to need to cite some data on a claim like that. Maybe it's replacing the fiverr end of the market? It's certainly much harder to justify paying someone to generate a (bad) logo or graphic when a diffusion model can do the same thing, but there's no way that a model, today, can replace a skilled artist. Or said differently: a skilled artist, combined with a good AI model, is vastly more productive than an unskilled artist with the same model.

Pay 10 non skilled artist to do some bad job and we will complain about 10 bad logos. Now, for a fraction of the price, pay 10000 AI generated low quality logos and flood the market with them. Market expectations will go lower and suddenly your AI will be on par with the artists...

(in case you think the market will not behave like that, just have a look at how we produce low quality food and how many people are perfectly fine with that)...

What happens when the AI takes the low end of the market is that the people who catered to the low end now have to try to compete more in the mid-to-high end. The mid end facing increased competition has to try to move up to the high end. So while AI may not be able to compete directly with the high end it will erode the negotiating power and thus the earning potential of the high end.

Or graphic design, or video editing, or audio mastering, or...every new tool has come with a bunch of people saying things like "what will happen to the linotype operators!?"

I sort of hate this line of argument, but it also has been manifestly true of the past, and rhymes with the present.

I agree, but I'm a bit biased, our start-up www.sticky.study is in this space.

What we've seen over the last year trying out dozens of models and AI workflows, is that the fit of 1.) error tolerance of a model to 2.) its working context, is super important.

AI hallucinations break a lot of otherwise useful implementations. It's just not trustworthy enough. Even with AI imagery, some use cases require precision - AI photoshoots and brand advertising come to mind.

The sweet spot seems to be as part of a pipeline where the user only needs a 90% quality output. Or you have a human + computer workflow - a type of "Centaur" - similar to Moravec's Paradox.

>For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI.

Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE

This is an LLM controlling an embodied VR body in a physics simulation.

It is responding to human voice input not only with voice but body movements.

Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do.

>This is an LLM controlling an embodied VR body in a physics simulation.

No it's not. It's VAM that is controlling the character and it's literally just using a bog standard LLM as a chatbot and feeding the text into a plugin in VAM and VAM itself does the animation. Don't get me wrong it's absolutely next level to experience chatbots this way, but it's still a chat bot.

The movement decisions are also just text from the LLM and are heavily coupled with what's available in the scene. It's not some free autonomous agent. Nor was the movement decisions trained any special type of tokens other than just text.

> anyone can easily see the unrealistic outputs without complex statistical tests.

This is key, we’re all pre-wired with fast correctness tests.

Are there other data types that match this?

I would argue the opposite — image generation is the clear loser. If you've ever tried to do it yourself, grabbing a bunch of LoRAs from Civitai to try to convince a model to draw something it doesn't initially know how to draw — it becomes clear that there's far too much unavoidable correlation between "form" and "representation" / "style" going on in even a SOTA diffusion model's hidden layers.

Unlike LLMs, that really seem to translate the text into "concepts" at a certain embedding layer, the (current, 2D) diffusion models will store (and thus require to be trained on) a completely different idea of a thing, if it's viewed from a slightly different angle, or is a different size. Diffusion models can interpolate but not extrapolate — they can't see a prompt that says "lion goat dragon monster" and come up with the ancient-greek Chimera, unless they've actually been trained on a Chimera. You can tell them "asian man, blond hair" — and if their training dataset contains asian men and men with blonde hair but never at the same time, then they won't be able to "hallucinate" a blond asian man for you, because that won't be an established point in the model's latent space.

---

On a tangent: IMHO the true breakthrough would be a model for "text to textured-3D-mesh" — where it builds the model out of parts that it shapes individually and assembles in 3D space not out of tris, but by writing/manipulating tokens representing shader code (i.e. it creates "procedural art"); and then it consistency-checks itself at each step not just against a textual embedding, but also against an arbitrary (i.e. controlled for each layer at runtime by data) set of 2D projections that can be decoded out to textual embeddings.

(I imagine that such a model would need some internal "blackboard" of representational memory that it can set up arbitrarily-complex "lenses" for between each layer — i.e. a camera with an arbitrary projection matrix, through which is read/written a memory matrix. This would allow the model to arbitrarily re-project its internal working visual "conception" of the model between each step, in a way controllable by the output of each step. Just like a human would rotate and zoom a 3D model while working on it[1]. But (presumably) with all the edits needing a particular perspective done in parallel on the first layer where that perspective is locked in.)

Until we have something like that, though, all we're really getting from current {text,image}-to-{image,video} models is the parallel layered inpainting of a decently, but not remarkably exhaustive pre-styled patch library, with each patch of each layer being applied with an arbitrary Photoshop-like "layer effect" (convolution kernel.) Which is the big reason that artists get mad at AI for "stealing their work" — but also why the results just aren't very flexible. Don't have a patch of a person's ear with a big earlobe seen in profile? No big-earlobe ear in profile for you. It either becomes a small-earlobe ear or the whole image becomes not-in-profile. (Which is an improvement from earlier models, where just the ear became not-in-profile.)

[1] Or just like our minds are known to rotate and zoom objects in our "spatial memory" to snap them into our mental visual schemas!

I think you’re arguing about slightly different things. OP said that image generation is useful despite all its shortcomings, and that the shortcomings are easy to deal with for humans. OP didn’t argue that the image generation AIs are actually smart. Just that they are useful tech for a variety of use cases.

> Until we have something like that...

The kind of granular, human-assisted interaction interface and workflow you're describing is, IMHO, the high-value path for the evolution of AI creative tools for non-text applications such as imaging, video and music, etc. Using a single or handful of images or clips as a starting place is good but as a semi-talented, life-long aspirational creative, current AI generation isn't that practically useful to me without the ability to interactively guide the AI toward what I want in more granular ways.

Ideally, I'd like an interaction model akin to real-time collaboration. Due to my semi-talent, I've often done initial concepts myself and then worked with more technically proficient artists, modelers, musicians and sound designers to achieve my desired end result. By far the most valuable such collaborations weren't necessarily with the most technically proficient implementers, but rather those who had the most evolved real-time collaboration skills. The 'soft skill' of interpreting my directional inputs and then interactively refining or extrapolating them into new options or creative combinations proved simply invaluable.

For example, with graphic artists I've developed a strong preference for working with those able to start out by collaboratively sketching rough ideas on paper in real-time before moving to digital implementation. The interaction and rapid iteration of tossing evolving ideas back and forth tended to yield vastly superior creative results. While I don't expect AI-assisted creative tools to reach anywhere near the same interaction fluidity as a collaboratively-gifted human anytime soon, even minor steps in this direction will make such tools far more useful for concepting and creative exploration.

...but I wasn't describing a "human-assisted interaction interface and workflow." I was describing a different way for an AI to do things "inside its head" in a feed-forward span-of-a-few-seconds inference pass.

Thanks for the correction. Not being well-versed in AI tech, I misinterpreted what you wrote and assumed it might enable more granular feedback and iteration.

Honestly, I am still to see an AI generated image that makes me go "oh wow". It's missing those 10 last percents that always seem to elude neural networks.

Also, the very bad press gen AI gets is very much slowing down adoption. Particularly among the creative-minded people, who would be the most likely users.

This is the third image to 3D AI I've tested, and in all cases the examples they give look like 2D renders of 3D models already. My tests were with cel-shaded images (cartoony, not with realistic lighting) and the model outputs something very flat but with very bad topology, which is worse than starting with a low poly or extruding the drawing. I suspect it is unable to give decent results without accurate shadows from which the normal vectors could be recomputed and thus lacks any 'understanding' of what the structure would be from the lines and forms.

In any case it would be cool if they specified the set of inputs that is expected to give decent results.

What stuck out to me from this release was this:

> Optional quad or triangle remeshing (adding only 100-200ms to processing time)

But it seems to have been optional. Did you try it with that turned on? I'd be very interested in those results, as I had the same experience as you, the models don't generate good enough meshes, so was hoping this one would be a bit better at that.

Edit: I just tried it out myself on their Huggingface demo and even with the predefined images they have there, the mesh output is just not good enough. https://i.imgur.com/e6voLi6.png

It might not just be your tests.

All of my tests of img2mesh technologies have produced poor results, even when using images that are very similar to the ones featured in their demo. I’ve never got fidelity like what they’ve shown.

I’ll give this a whirl and see if it performs better.

I really can't wait for this technology to improve. Unfortunately just from testing this it seems not very useful. It takes more work to modify the bad model it approximates from the image output than starting with a good foundation from scratch. I would rather see something that took a series of steps to reach a higher quality end product more slowly instead of expecting everything to come from one image. Perhaps i'm missing the use case?

> not very useful

Useful for what? I think use cases will emerge.

A lot of critiques assume you're working in VFX or game development. Making image to 3d (and by extension text to image to 3d) effortless a whole host of new applications open up - which might not be anywhere near so demanding.

Perhaps it'll require a series of segmentation and transforms that improves individual components and then works up towards the full 3d model of the image.

> 0.5 seconds per 3D asset generation on a GPU with 7GB VRAM

Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.

Not the holy grail yet, but pretty cool!

I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.

You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.

I'm really excited for something in this area to really deliver, and it's really cool that I can just drag pictures into the demo on HuggingFace [0] to try it.

However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.

[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d

They're still hesitant to show the untextured version of the models so I would assume it's like previous efforts where most of the detail is in the textures, and the model itself, the part you would 3D print, isn't so impressive.

You can download a .glb file (from the HuggingFace demo page) and open it locally (e.g. in MS 3D Viewer). I'm looking at a mesh from one of the better examples I tried and it's actually pretty good...

You know I do wonder about this. If its just for static assets does it really matter? In something like Unreal, the textures are going to be virtualized and the geometry is going to be turned in to LODed triangle soup anyway.

Has anyone tried to build an Unreal scene with these generated meshes?

Usually the problem is the model itself is severely lacking in detail, sure Nanite could make light work of a poorly optimized model but it's not going to fix the model being a vague blob which doesn't hold up to close scrutiny.

I was going to comment on the same; these 3d reconstructions often generate a mess of a topology, and this post does not show any of the mesh triangulations, so I assume they're still not good. Arguably, the meshes are bad even for rendering.

Hopefully that's in the (near) future, but as of now there still exists 'retopo' in 3D work for a reason. Just like roto and similar menial tasks. We're getting there with automation though.

It really looks like they've been doing that classic infomercial tactic of desaturating the images of the things they're comparing against to make theirs seen better.

Closer and closer to the automatic mapping drones from Prometheus.

I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.

High-res images from multiple perspectives should be sufficient. If you have a consumer drone, this product (no affiliation) is extremely impressive: https://www.dronedeploy.com/

You basically select an area on a map that you want to model in 3d, it flies your drone (take-off, flight path, landing), takes pictures, uploads to their servers for processing, generates point cloud, etc. Very powerful.

Given the Graphics Asset part of AA or AAA Games are the most expensive, i wonder if 3D Asset Generation could perhaps drastically lower that by 50% or more? At least in terms of same output. Because in reality I guess artist will just spend more time in other areas.

What I'd really like to see in these kinds of articles is examples of it not working as well. I don't necessarily want to see it being perfect, I'd quite like to see its limitations too

Great result. Just had a play around with the demo models and they preserve structure really nicely; although the textures are still not great. It's kind of a voxelized version of the input image

This is a great step forward.

I wonder whether RAG based 3D animation generation can be done with this.

1. Textual description of a story.

2. Extract/generate keywords from the story using LLM.

3. Search and look up 2D images by the keywords.

4. Generate 3D models from the 2D images using Stable Fast 3D.

5. Extract/generate path description from the story using LLM.

6. Generate movement/animation/gait using some AI.

...

7. Profit??

Man it would be so cool to get AI-assisted photogrammetry. Imagine that instead of taking a hundred photos or a slow scan and having to labor over a point cloud, you could just take like three pictures and then go down a checklist. "Is this circular? How long is this straight line? Is this surface flat? What's the angle between these two pieces?" and getting a perfect replica or even a STEP file out of it. Heaven for 3D printers.

（评论） (comments)

（评论）
(comments)