![]() |
|
![]() |
| Do they have a local model?
I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally. |
![]() |
| >For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI.
Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE This is an LLM controlling an embodied VR body in a physics simulation. It is responding to human voice input not only with voice but body movements. Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do. |
![]() |
| > anyone can easily see the unrealistic outputs without complex statistical tests.
This is key, we’re all pre-wired with fast correctness tests. Are there other data types that match this? |
![]() |
| Thanks for the correction. Not being well-versed in AI tech, I misinterpreted what you wrote and assumed it might enable more granular feedback and iteration. |
![]() |
| What stuck out to me from this release was this:
> Optional quad or triangle remeshing (adding only 100-200ms to processing time) But it seems to have been optional. Did you try it with that turned on? I'd be very interested in those results, as I had the same experience as you, the models don't generate good enough meshes, so was hoping this one would be a bit better at that. Edit: I just tried it out myself on their Huggingface demo and even with the predefined images they have there, the mesh output is just not good enough. https://i.imgur.com/e6voLi6.png |
![]() |
| Perhaps it'll require a series of segmentation and transforms that improves individual components and then works up towards the full 3d model of the image. |
![]() |
| I'm really excited for something in this area to really deliver, and it's really cool that I can just drag pictures into the demo on HuggingFace [0] to try it.
However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls. [0] https://huggingface.co/spaces/stabilityai/stable-fast-3d |
![]() |
| Hopefully that's in the (near) future, but as of now there still exists 'retopo' in 3D work for a reason. Just like roto and similar menial tasks. We're getting there with automation though. |
![]() |
| It really looks like they've been doing that classic infomercial tactic of desaturating the images of the things they're comparing against to make theirs seen better. |
![]() |
| High-res images from multiple perspectives should be sufficient. If you have a consumer drone, this product (no affiliation) is extremely impressive:
https://www.dronedeploy.com/
You basically select an area on a map that you want to model in 3d, it flies your drone (take-off, flight path, landing), takes pictures, uploads to their servers for processing, generates point cloud, etc. Very powerful. |
![]() |
| What I'd really like to see in these kinds of articles is examples of it not working as well. I don't necessarily want to see it being perfect, I'd quite like to see its limitations too |
![]() |
| Great result. Just had a play around with the demo models and they preserve structure really nicely; although the textures are still not great. It's kind of a voxelized version of the input image |
* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.
* anyone can easily see the unrealistic and biased outputs without complex statistical tests.
* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)
* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.
* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.
* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).
* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.
I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.