(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39360106

根据上面提供的文本,我们可以得出以下结论: 1.稳定的Cascade以显着降低的资源消耗水平提供高质量的生成图像方法,从而实现更快的推理速度和对系统的要求较低。 2. 创建更小、更高效的人工智能模型的趋势日益明显,市场上出现的产品数量不断增加就证明了这一点。 3. 一些争论围绕稳定级联是否在每次渲染迭代中创建相同的图像或根据角度略有不同的版本。 然而,作者指出,由于 Stable Cascade 是为更复杂的组合而设计的,因此按理说,迭代之间可能会出现细微的差异。 4. 一位匿名发帖者质疑 Stable Cascade 中的提示对齐是否比以前的模型更好,这表明比以前的版本有所改进。 5. 虽然一些批评针对的是 Stable Cascade 讨论其功能的方式,但其他人则认为它可与高效图像压缩文件格式相媲美,在不牺牲视觉细节的情况下实现 42 倍空间压缩。 6. 作者提到将阶段 B 调节集成到潜在空间时遇到了困难,因此选择了由更简单的技术(例如简单通道串联)组成的替代解决方案。 7. 其他作者提到尝试使用其他形式的位置嵌入来提高调节功效,从而以更少的步骤和更少的总体所需计算资源获得更好的结果。 总体而言,本文认为,日益完善和复杂的人工智能算法的发展代表了技术的重大进步,Stable Cascade 等产品就是例证,这些产品证明了计算开销的显着减少,以换取生成视觉丰富输出的效率的提高。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Stable Cascade (github.com/stability-ai)
623 points by davidbarker 17 hours ago | hide | past | favorite | 135 comments










Been using it for a couple of hours and it seems it’s much better at following the prompt. Right away it seems the quality is worse compared to some SDXL models but I’ll reserve judgement until a couple more days of testing.

It’s fast too! I would reckon about 2-3x faster than non-turbo SDXL.



I'll take prompt adherence over quality any day. The machinery otherwise isn't worth it i.e the controlnets, openpose, depthmaps just to force a particular look or to achieve depth. Th solution becomes bespoke for each generation.

Had a test of it and my option is it's an improvement when it comes to following prompts and I do find the images more visually appealing.



Can we use its output as input to SDXL? Presumably it would just fill in the details, and not create whole new images.


I was thinking that exactly. You could use the same trick as the hires-fix for an adherence-fix.


Yeah chain it in comfy to a turbo model for detail


How much VRAM does it need? They mention that the largest model uses 1.4 billion parameters more than SDXL, which in turn need a lot of VRAM.


Should use no more than 6GiB for FP16 models at each stage. The current implementation is not RAM optimized.


The large C model uses 3.6 billion parameters which is 6.7 GiB if each parameter is 16 bits.


The large C model have fair bit of parameters tied to text-conditioning, not to the main denoising process. Similar to how we split the network for SDXL Base, I am pretty confident we can split non-trivial amount of parameters to text-conditioning hence during denoising process, loading less than 3.6B parameters.


What's more, they can presumably be swapped in and out like the SDXL base + refiner, right?


There was a leak from Japan yesterday, prior to this release, and in that it was suggested 20gb for the largest model.

This text was part of the Stability Japan leak (the 20gb VRAM reference was dropped in the release today):

"Stages C and B will be released in two different models. Stage C uses parameters of 1B and 3.6B, and Stage B uses parameters of 700M and 1.5B. However, if you want to minimize your hardware needs, you can also use the 1B parameter version. In Stage B, both give great results, but 1.5 billion is better at reconstructing finer details. Thanks to Stable Cascade's modular approach, the expected amount of VRAM required for inference can be kept at around 20GB, but can be reduced even further by using smaller variations (as mentioned earlier, this (which may reduce the final output quality)."



Thanks. I guess this means that fewer people will be able to use it on their own computer, but the improved efficiency makes it cheaper to run on servers with enough VRAM.

Maybe running stage C first, unloading it from VRAM, and then do B and A would make it fit in 12 or even 8 GB, but I wonder if the memory transfers would negate any time saving. Might still be worth it if it produces better images though.



Sequential model offloading isn’t too bad. It adds about a second or less to inference, assuming it still fits in main memory.


Sometimes I forget how fast modern computers are. PCIe v4 x16 has a transfer speed of 31.5 GB/s, so theoretically it should take less than 100 ms to transfer stage B and A. Maybe it's not so bad after all, it will be interesting to see what happens.


If it worked I imagine large batching could make it worth the load/unload time cost.


Can one run it on CPU?


Stable Diffusion on a 16 core AMD CPU takes for me about 2-3 hours to generate an image, just to give you a rough idea of the performance. (On the same AMD's iGPU it takes 2 minutes or so).


Even older GPUs are worth using then I take it?

For example I pulled a (2GB I think, 4 tops) 6870 out of my desktop because it's a beast (in physical size, and power consumption) and I wasn't using it for gaming or anything, figured I'd be fine just with the Intel integrated graphics. But if I wanted to play around with some models locally, it'd be worth putting it back & figuring out how to use it as a secondary card?



One counterintuitive advantage of the integrated GPU is it has access to system RAM (instead of using a dedicated and fixed amount of VRAM). That means I'm able to give the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when running. The system RAM is slower than VRAM which is the trade-off here.


Yeah I did wonder about that as I typed, which is why I mentioned the low amount (by modern standards anyway) on the card. OK, thanks!


No, I don't think so. I think you would need more VRAM to start with.


2GB is really low. I've been able to use A111 stable diffusion on my old gaming laptop's 1060 (6GB VRAM) and it takes a little bit less than a minute to generate an image. You would probably need to try the --lowvram flag on startup.


Which AMD CPU/iGPU are these timings for?


SDXL Turbo is much better, albeit kinda fuzzy and distorted. I was able to get decent single-sample response times (~80-100s) from my 4 core ARM Ampere instance, good enough for a Discord bot with friends.


Sd turbo runs nicely on a m2 MacBook Air (as does stable lm 2!)

Much faster models will come



If that is true, then the CPU variant must be a much worse implementation of the algorithm than the GPU variant, because the true ratio of the GPU and CPU performances is many times less than that.


Not if you want to finish the generation before you have stopped caring about the results.


You can run any ML model on CPU. The question is the performance


Very impressive.

From what I understand, Stability AI is currently VC funded. It’s bound to burn through tons of money and it’s not clear whether the business model (if any) is sustainable. Perhaps worthy of government funding.



Stability AI has been burning through tons of money for awhile now, which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

> The company is spending significant amounts of money to grow its business. At the time of its deal with Intel, Stability was spending roughly $8 million a month on bills and payroll and earning a fraction of that in revenue, two of the people familiar with the matter said.

> It made $1.2 million in revenue in August and was on track to make $3 million this month from software and services, according to a post Mostaque wrote on Monday on X, the platform formerly known as Twitter. The post has since been deleted.

https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...



I get the impression that a lot of open source adjacent AI companies, including Stability AI, are in the "???" phase of execution, hoping the "Profit" phase comes next.

Given how much VC money is chasing the AI space, this isn't necessarily a bad plan. Give stuff away for free while developing deep expertise, then either figure out something to sell, or pivot to proprietary, or get aquihired by a tech giant.



That is indeed the case, hence the more recent pushes toward building moats by every AI company.


> which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

The main reason is probably Mid journey and OpenAi using their tech without any kind of contribution back. AI desperately needs a GPL equivalent…



It's highly doubtful that Midjourney and OpenAI use Stable Diffusion or other Stability models.


Midjourney 100% at least used to use Stable Diffusion: https://twitter.com/EMostaque/status/1561917541743841280

I am not sure if that is still the case.



It trialled it as an explicitly optional model for a moment a couple years ago. (or only a year? time moves so fast. somewhere in v2/v3 timeframe and around when SD came out). I am sure it is no longer the case.


DALL-E shares the same autoencoders as SD v1.x. It is probably similar to how Meta's Emu-class models work though. They tweaked the architecture quite a bit, trained on their own dataset, reused some components (or in Emu case, trained all the components from scratch but reused the same arch).


How do you know though?


You can't use off-the-shelf models to get the results Midjourney and DALL-E generate, even with strong finetuning.


I pay for both MJ and DALL-E (though OpenAI mostly gets my money for GPT) and don't find them to produce significantly better images than popular checkpoints on CivitAI. What I do find is that they are significantly easier to work with. (Actually, my experience with hundreds of DALL-E generations is that it's actually quite poor in quality. I'm in several IRC channels where it's the image generator of choice for some IRC bots, and I'm never particularly impressed with the visual quality.)

For MJ in particular, knowing that they at least used to use Stable Diffusion under the hood, it would not surprise me if the majority of the secret sauce is actually a middle layer that processes the prompt and converts it to one that is better for working with SD. Prompting SD to get output at the MJ quality level takes significantly more tokens, lots of refinement, heavy tweaking of negative prompting, etc. Also a stack of embeddings and LoRAs, though I would place those more in the category of finetuning like you had mentioned.



If you try diffusionGPT with regional prompting added and a GAN corrector you can get a good idea of what is possible https://diffusiongpt.github.io


That looks very impressive unless the demo is cherrypicked, would be great if this could be implemented into a frontend like Fooocus https://github.com/lllyasviel/Fooocus


What do you use it for? I haven't found a great use for it myself (outside of generating assets for landing pages / apps, where it's really really good). But I have seen endless subreddits / instagram pages dedicated to various forms of AI content, so it seems lots of people are using it for fun?


Nothing professional. I run a variety of tabletop RPGs for friends, so I mostly use it for making visual aids there. I've also got a large format printer that I was no longer using for it's original purpose, so I bought a few front-loading art frames that I generate art for and rotate through periodically.

I've also used it to generate art for deskmats I got printed at https://specterlabs.co/

For commercial stuff I still pay human artists.



Whose frames do you use? Do you like them? I print my photos to frame and hang, and wouldn't at all mind being able to rotate them more conveniently and inexpensively than dedicating a frame to each allows.


https://www.spotlightdisplays.com/

I like them quite a bit, and you can get basically any size cut to fit your needs even if they don't directly offer it on the site.



Perfectly suited to go alongside the style of frame I already have lots of, and very reasonably priced off the shelf for the 13x19 my printer tops out at. Thanks so much! It'll be easier to fill that one blank wall now.


What IRC Channels do you frequent?


Largely some old channels from the 90s/00s that really only exist as vestiges of their former selves - not really related to their original purpose, just rooms for hanging out with friends made there back when they had a point besides being a group chat.


Midjourney has absolutely nothing to offer compared to proper finetunes. DALL-E has: it generalizes well (can make objects interact properly for example) and has great prompt adherence. But it can also be unpredictable as hell because it rewrites the prompts. DALL-E's quality is meh - it has terrible artifacts on all pixel-sized details, hallucinations on small details, and limited resolution. Controlnets, finetuning/zero-shot reference transfer, and open tooling would have made a beast of a model of it, but they aren't available.


That's not really true, MJ and DALL-E are just more beginner friendly.


I think it'd be interesting to have a non-profit "model sharing" platform, where people can buy/sell compute. When you run someone's model, they get royalties on the compute you buy.


More specifically, it's so Stability AI can theoretically make a business on selling commercial access to those models through a membership: https://stability.ai/news/introducing-stability-ai-membershi...


The net flow of knowledge about text-to-image generation from OpenAI has definitely been outward. The early open source methods used CLIP, which OpenAI came up with. Dall-e (1) was also the first demonstration that we could do text to image at all. (There were some earlier papers which could give you a red splotch if you said stop sign or something years earlier).


> AI desperately needs a GPL equivalent

Why not just the GPL then?



The GPL was intended for computer code that gets compiled to a binary form. You can share the binary, but you also have to share the code that the binary is compiled from. Pre-trained model weights might be thought of as analogous to compiled code, and the training data may be analogous to program code, but they're not the same thing.

The model weights are shared openly, but the training data used to create these models isn't. This is at least partly because all these models, including OpenAI's, are trained on copyrighted data, so the copyright status of the models themselves is somewhat murky.

In the future we may see models that are 100% trained in the open, but foundational models are currently very expensive to train from scratch. Either prices would need to come down, or enthusiasts will need some way to share radically distributed GPU resources.



Tbh I think these models will largely be trained on synthetic datasets in the future. They are mostly trained on garbage now. We have been doing opt outs on these, has been interesting to see quality differential (or lack thereof), eg removing books3 from stableLM 3b zephyr https://stability.wandb.io/stability-llm/stable-lm/reports/S...


Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.


We are starting to see that, see phi2 for example

There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example

https://arxiv.org/pdf/2310.13032.pdf



I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.


We did try stableLM 3b4 with books3 and it got worse in general and benchmarks

Just did some pes2o ablations too which were eh



What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.

One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.

But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.

It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.



What about CC licenses for model weights? It's common for files ("images", "video", "audio", ...) So maybe appropriate.


I've seen Emad (Stability AI founder) commenting here on HN somewhere about this before, what exactly their business model is/will be, and similar thoughts.

HN search doesn't seem to agree with me today though and I cannot find the specific comment/s I have in mind, maybe someone else has any luck? This is their user https://news.ycombinator.com/user?id=emadm



https://x.com/EMostaque/status/1649152422634221593?s=20

We now have top models of every type, sites like www.stableaudio.com, memberships, custom model deals etc so lots of demand

We're the only AI company that can make a model of any type for anyone from scratch & are the most liked / one of the most downloaded on HuggingFace (https://x.com/Jarvis_Data/status/1730394474285572148?s=20, https://x.com/EMostaque/status/1727055672057962634?s=20)

Its going ok, team working hard and shipping good models, the team are accelerating their work on building ComfyUI to bring it all together.

My favourite recent model was CheXagent, I think medical models should be open & will really save lives: https://x.com/Kseniase_/status/1754575702824038717?s=20



exactly my thought. stability should be receiving research grants


We should, we haven't yet...

Instead we've given 10m+ supercomputer hours in grants to all sorts of projects, now we have our grant team in place & there is a huge increase in available funding for folk that can actually build stuff we can tap into.



None of the researchers are associated with stability.ai, but with universities in Germany and Canada. How does this work? Is this exclusive work for stability.ai?


Dom and Pablo both work for Stability AI (Dom finishing his degree).

All the original Stable Diffusion researchers (Robin Rombach, Patrick Esser, Dominik Lorenz, Andreas Blattman) also work for Stability AI.



Finally a good use to burn VC money!


I see in the commits that the license was changed from MIT to their own custom one: https://github.com/Stability-AI/StableCascade/commit/209a526...

Is it legal to use an older snapshot before the license was changed in accordance with the previous MIT license?



It seems pretty clear the intent was to use a non-commercial license, so it’s probably something that would go to court, if you really wanted to press the issue.

Generally courts are more holistic and look at intent, and understand that clerical errors happen. One exception to this is if a business claims it relied on the previous license and invested a bunch of resources as a result.

I believe the timing of commits is pretty important— it would be hard to claim your business made a substantial investment on a pre-announcement repo that was only MIT’ed for a few hours.



If I clone/fork that repo before the license change, and start putting any amount of time into developing my own fork in good faith, they shouldn't be allowed to claim a clerical error when they lied to me upon delivery about what I was allowed to do with the code.

Licenses are important. If you are going to expose your code to the world, make sure it has the right license. If you publish your code with the wrong license, you shouldn't be allowed to take it back. Not for an organization of this size that is going to see a new repo cloned thousands of times upon release.



No, sadly this won’t fly in court.

For the same reason you cannot publish a private corporate repo with an MIT license and then have other people claim in “good faith” to be using it.

All they need is to assert that the license was published in error, or that the person publishing it did not have the authority to publish it.

You can’t “magically” make a license stick by putting it in a repo, any more than putting a “name here” sticker on someone’s car and then claiming to own it.

The license file in the repo is simply the notice of the license.

It does not indicate a binding legal agreement.

You of course, can challenge it in court, and ianal, but I assure you, there is president in incorrectly labelled repos removing and changing their licenses.



There’s no case law here, so if you’re volunteering to find out what a judge thinks we’d surely appreciate it!


Yes, you can continue to do what you want with that commit^ in accordance with the MIT licence it was released under. Kind of like if you buy an ebook, and then they publish a second edition but only as a hardback - the first edition ebook is still yours to read.


I think the model architecture (training code etc.) itself is still under MIT while the weights (which was the result of training in a huge GPU cluster as well as the dataset they have used [not sure if they publicly talked about it] is under this new license.


Code is MIT, weights are under the NC license for now.


MIT license is not parasitic like GPL. You can close an MIT licensed codebase, but you cannot retroactively change the license of the old code.

Stability's initial commit had an MIT license, so you can fork that commit and do whatever you want with it. It's MIT licensed.

Now, the tricky part here is that they committed a change to the license that changes it from MIT to proprietary, but they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses. They can only license the changes made to the codebase after the license change. I wouldn't call it "illegal", but it wouldn't stand up in court if they tried to claim that the software is proprietary, because they already distributed it verbatim with an open license.



> they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses.

Why couldn't they? Of course they can. If you are the copyright owner, you can publish/sell your stuff under as many licenses as you like.



we have an optimized playground here: https://www.fal.ai/models/stable-cascade


"sign in to run"

That's a marketing opportunity being missed, especially given how crowded the space is now. The HN crowd is more likely to run it themselves when presented with signing up just to test out a single generation.



Uh, thanks for noticing it! We generally turn it off for popular models so people can see the underlying inference speed and the results but we forgot about it for this one, it should now be auth-less with a stricter rate limit just like other popular models in the gallery.


I just got rate-limited on my first generation. The message is "You have exceeded the request limit per minute". This was after showing me cli output suggesting that my image was being generated.

I guess my zero attempts per minute was too much. You really shouldn't post your product on HN if you aren't prepared for it to work. Reputations are hard to earn, and you're losing people's interest by directing them to a broken product.



Are you using a vpn or at a large campus or office?


It uses github auth, it’s not some complex process. I can see why they would need to require accounts so it’s harder to abuse it.


After all the bellyaching from the HN crowd when PyPI started requiring 2FA, nothing surprises me anymore.


Like every other image generator I've tried, it can't do a piano keyboard [1]. I expect that some different approach is needed to be able to count the black keys groups.

[1] https://fal.ai/models/stable-cascade?share=13d35b76-d32f-45c...



I think it's more than this. In my case in most of images I made about basketball there were more than one ball. I'm not an expert, but some fundamental constrains of the human (cultural) life (like all piano keys are the same, there's only one ball in a game) are not grasped by the training or grasped partially


As with human hands, coherency is fixed by scaling the model and the training.


I'm very impressed by the recent AI progress on making models smaller and more efficient. I just have the feeling that every week there's something big on this space (like what we saw previously from ollama, llava, mixtral...). Apparently the space for on-device models are not fully discovered yet. Very excited to see future products on that direction.


> I'm very impressed by the recent AI progress on making models smaller and more efficient.

That's an odd comment to place in a thread about an image generation model that is bigger than SDXL. Yes, it works in a smaller latent space, yes its faster in the hardware configuration they've used, but its not smaller.



Is there any way this can be used to generate multiple images of the same model? e.g. a car model rotated around (but all images are of the same generated car)


Someone with resources will have to train Zero123 [1] with this backbone.

[1] https://zero123.cs.columbia.edu/





Yes, input image => embedding => N images, and if you're thinking 3D perspectives for rendering, you'd ControlNet the N.

ref.: "The model can also understand image embeddings, which makes it possible to generate variations of a given image (left). There was no prompt given here."



The model looks different in each of those variations though. Which seems to be intentional, but the post you're responding to is asking whether it's possible to keep the model exactly the same in each render, varying only by perspective.


Was anyone able to get this running on Colab? I got as far as loading extras in text-to-inference, but it was complaining about a dependency.


Will this work on AMD? Found no mention of support. Kinda an important feature for such a project, as AMD users running Stable Diffusion will be suffering diminished performance.


the way it's written about in Image Reconstruction section like it is just an image compression thing...is kind of interesting. for that stuff and its presented use there to be very much about storing images and reconstructing them. when "it doesn't actually store original images" and "it can't actually give out original images" are points that get used so often in arguments as a defense for image generators. so it is just a multi-image compression file format, just a very efficient one. sure, it's "redrawing"/"rendering" its output and makes things look kinda fuzzy, but any other compressed image format does that as well. what was all that 'well it doesn't do those things' nonsense about then? clearly it can do that.


>well it doesn't do those things' nonsense about then? clearly it can do that.

There is a model that is trained to compress (very lossy) and decompress the latent, but it's not the main generative model, of course the model doesn't store images in it, you just give the encoder an image and it will encode it and then you can decode it with the decoder and get a very similar image, this encoder and decoder is used during training so that the stage C can work on a compressed latent instead of directly at the pixel level because it's expensive, but the main generative model (stage C) should be able to generate any of the images that were present in the dataset or it fails to do its job. Stages C, B, and A do not store any images.

The B and A stages work like an advanced image decoder, so unless you have something wrong with image decoders in general, I don't see how this could be a problem (a JPEG decoder doesn't store images either, of course).



Ultimately this is abstraction not compression.


In a way it's just an algorithm than can compress either text or an image. The neat trick is that if you compress the text "brown bear hitting Vladimir Putin" and then decompress it as an image, you get an image of a bear hitting Vladimir Putin.

This principle is the idea behind all Stable Diffusion models, this one "just" achieved a much better compression ratio



well yeah. but it's not so much about what it actually does, but how they talk about it. maybe (probably) i missed them putting out something that's described like that before, but it's just the open admission in demonstration of it. i guess they're getting more brazen, given than they're not really getting punished for what they're doing, be it piracy or infringement or whatever.


Why are they benchmarking it with 20+10 steps vs. 50 steps for the other models?


prior generations usually take fewer steps than vanilla SDXL to reach the same quality.

But yeah, the inference speed improvement is mediocre (until I take a look at exactly what computation performed to have more informed opinion on whether it is implementation issue or model issue).

The prompt alignment should be better though. It looks like the model have more parameters to work with text conditioning.



in my observation, it yields amazing perf at higher batch sizes (4 or better 8). i assume it is due to memory bandwith and the constrained latent space helping.


I think that this model used consistency loss during training so that it can yield better results with less steps.


It is pretty good I shared a comparison on medium

https://medium.com/@furkangozukara/stable-cascade-prompt-fol...

My Gradio APP even works amazing on 8 GB gpu with CPU offloading



I haven't been following the image generation space since the initial excitement around stable diffusion. Is there an easy to use interface for the new models coming out?

I remember setting up the python env for stable diffusion, but then shortly after there were a host of nice GUIs. Are there some popular GUIs that can be used to try out newer models? Similarly, what's the best GUI for some of the older models? Preferably for macos.



Fooocus is the fastest way to try SDXL/SDXL turbo with good quality.

ComfyUI is cool but very DIY. You don't get good results unless you wrap your head around all the augmentations and defaults.

No idea if it will support cascade.



ComfyUI is similar to Houdini in complexity, but immensely powerful. It's a joy to use.

There are also a large amount of resources available for it on YouTube, GitHub (https://github.com/comfyanonymous/ComfyUI_examples), reddit (https://old.reddit.com/r/comfyui), CivitAI, Comfy Workflows (https://comfyworkflows.com/), and OpenArt Flow (https://openart.ai/workflows/).

I still use AUTO1111 (https://github.com/AUTOMATIC1111/stable-diffusion-webui) and the recently released and heavily modified fork of AUTO1111 called Forge (https://github.com/lllyasviel/stable-diffusion-webui-forge).



Our team at Stability AI build ComfyUI so yeah is supported


Auto1111 and Comfy both get updated pretty quickly to support most of the new models coming out. I expect they'll both support this soon.


Check out invoke.com


Thanks for calling us out - I'm one of the maintainers.

Not entirely sure we'll be in the Stable Cascade race quite yet. Since Auto/Comfy aren't really built for businesses, they'll get it incorporated sooner vs later.

Invoke's main focus is building open-source tools for the pros using this for work that are getting disrupted, and non-commercial licenses don't really help the ones that are trying to follow the letter of the license.

Theoretically, since we're just a deployment solution, it might come up with our larger customers who want us to run something they license from Stability, but we've had zero interest on any of the closed-license stuff so far.



fal.ai is nice and fast: https://news.ycombinator.com/item?id=39360800 Both in performance and for how quickly they integrate new models apparently: they already support Stable Cascade.


What is the system requirements needed to run this, particularly how much vram it would take?


I'd say I'm most impressed by the compression. Being able to compress an image 42x is huge for portable devices or bad internet connectivity (or both!).


I have to imagine at this point someone is working toward a fast AI based video codec that comes with a small pretrained model and can operate in a limited memory environment like a tv to offer 8k resolution with low bandwidth.


I am 65% sure this is already extremely similar to LGs upscaling approach in their most recent flagship


I would be shocked if Netflix was not working on that.


That is 42x spatial compression, but it needs 16 channels instead of 3 for RGB.


Even assuming 32 bit floats (the extra 4 on the end):

4*16*24*24*4 = 147,456

vs (removing the alpha channel as it's unused here)

3*3*1024*1024 = 9,437,184

Or 1/64 raw size, assuming I haven't fucked up the math/understanding somewhere (very possible at the moment).



Furthermore, each of those 16 channels would typically be mutibyte floats as opposed to single byte RGB channels. (speaking generally, haven't read the paper)


a 42x compression is also impressive as it matches the answer to the ultimate question of life, the universe, and everything, maybe there is some deep universal truth within this model.


Does anyone have a link to a demo online?




Wow like the compression part. 42 fixed times compression. That is really nice. Slow to unpack on the fly. But the future is waiting.


That is a very tiny latent space. Wow!


Where can I run it if I don't have a GPU? Colab didn't work


runpod, kaggle, lambda labs, or pretty much any other server provider that gives you one or more gpus.


I remember doing some random experiments with these two researchers to find the best way to condition the stage B on the latent, my very fancy cross-attn with relative 2D positional embeddings didn't work as well as just concatenating the channels of the input with the nearest upsample of the latent, so I just gave up ahah.

This model used to be known as Würstchen v3.



Will this get integrated into Stable Diffusion Web UI?


Surely within days. ComfyUI’s maintainer said he is readying the node for release perhaps by this weekend. The Stable Cascade model is otherwise known as Würschten v3 and has been floating around the open source generative image space since fall.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com