（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41130620

该用户讨论了一些人在人工智能讨论中将人工智能 (AI) 与纳粹德国进行比较的担忧，认为这种做法重复且不严肃。他们认为人工智能只是一种工具，就像微软画图或油画一样，任何人都可以滥用这些工具。用户分享了他们使用 Ideogram 的体验，Ideogram 是一个根据用户提示生成艺术图像的在线平台，并表示它可以生成高质量的图像并严格遵循用户指令。然而，他们表示希望有一个类似的程序可以在本地运行而不需要过滤器。他们提到尝试使用表意文字，赞扬图像质量和迅速遵守，同时指出在清晰地表达复杂的想法方面存在一些困难。他们分享了 Ideogram 的一张图像中代表的四种创意类别的示例，包括苦苦挣扎的作家、复印和粘贴艺术家、项目检索者和混音者。此外，用户还讨论了稳定扩散模型 (SD) 的演变，比较了 SD1 和 SD3，描述了 SD3 中一致性和场景组成的改进，并预测了工程图中的详细程度的进一步进展。最后，他们认为通用模型将随着时间的推移而改进，而不需要专门的专业知识。

The playground is a drag. After accepting being forced to sign up, attach my GitHub, and hand over my email address, I entered the desired prompt and waited with anticipation.. Only to see a black screen and how much it's going to cost per megapixel.

Bummer. After seeing what was generated in the blog post I was excited to try it! Now feeling disappointed.

I was hoping it'd be more like https://play.go.dev.

Good luck.

I was inspired by the SD3 problems to use this prompt:

"a woman lying on her back wearing a blouse and shorts."

But it wouldn't render the image - i instead got a NSFW warning. That's one way to hide the fact that it cannot render it properly i guess...

PS: after a few tries it rendered "a woman lying on her back" correctly.

Ah. I try the following:

> A Gary Larsen, "Far Side" comic of a racoon disguising itself by wearing a fedora and long trench coat. The raccoon's face is mostly hidden by the fedora. There are extra paws sticking out of the front of the trench coat from between the buttons, suggesting that the racoon is in fact a stack of several raccoons.

Every human I've ever described this to has no problem picturing what I mean. It's a classic comic trope. AIs still struggle.

A rough rule of thumb is that if a text-generator AI model of some size would struggle to understand your sentence, then an image-generator model a couple of times the size or even bigger would also struggle.

The intelligence just doesn't "fit" in there.

Personally I'm curious to see what would happen if someone burnt $100M of compute time on training a truly enormous image generator model, something the same-ish size as GPT4...

>Every human I've ever described this to has no problem picturing what I mean. It's a classic comic trope. AIs still struggle.

But AIs learn and therefore create in exactly the same way as humans, ostensibly on the same data. How can this be possible? /s

Sure. Normally I try a few variants, but "lamb with seven horns" was what I tried when I made that post.

For what it's worth, I've previously asked in the Stable Diffusion Discord server for help generating a "lamb with seven horns and seven eyes" but the members there were also unsuccessful.

I never said it was useless, just that it fails at this specific problem. One of my complaints with many of these image generation tools is that there's not much communication as to what should be expected from them, nor do they explain the areas where they're expected to succeed or fail.

Recently Claude began to allow generation of SVG drawings, and asking it to draw a unicorn and later add extra tails or horns worked correctly.

A fork exists in physical space and it's pretty intuitive to understand what it can do. These models exist within digital space and are incredibly opaque by comparison.

"Recently Claude began to allow generation of SVG drawings, and asking it to draw a unicorn and later add extra tails or horns worked correctly."

That sounds interesting! Were the results somewhat clean and clear SVG or rather a mess that just looked decent?

You also might want to "clarify" that it is not open source (and neither are any of the other "open source" models). If you want to call it something, try "open weights", although the usage restrictions make even that a HUGE FUCKING STRETCH.

Also, everybody should remember that these models are not copyrightable and you should never agree to any license for them...

When I read "open source" i thought they actually are doing open source instead of "open weights" this time. Surely they would expect to be called out on hackernews if they label it incorrectly...

Thanks for pointing that out @Hizonener

It's certainly not true that models are not copyrightable; databases have copyright protection if creativity was involved in creating them.

That said, I don't think outputs of the model are derivative works of it, any more than the model is a derivative of its training data, so it's not clear to me they can actually enforce what you do with them.

> It's certainly not true that models are not copyrightable; databases have copyright protection if creativity was involved in creating them.

Are you talking about https://en.wikipedia.org/wiki/Database_right or plain old copyright?

I'm no IP lawyer, but I've always thought that copyright put "requirements" on the artefact (i.e the threshold of originality), not the process.

In my jurisdiction we have database rights, meaning that you get IP protections for the artefact based on the work put into the process. For example a database of distances between adress pairs or something is probably not copyrightable, but can be protected under database rights if enough work was done to compile the data.

EDIT: Saw in another place in thread speaking about the https://en.wikipedia.org/wiki/Sweat_of_the_brow doctrine, relates to Database rights. (Neither of which notably are not applicable in the U.S)

A personal bugbear is the AI fascination with calling themselves open source, virtue signalling I guess. Open weights is exactly right. Source code and arguably more important datasets are both required to replicate the work, which is more in the spirit of open source (and science). I think Meta is especially egregious here, given their history.

Never underestimate the value of getting hordes of unpaid workers to refine your product. (See also React, others)

> it is not open source

It would be nice here if you give some examples of what you call open source model. Please ;) Because the impression is that these things do not exist, it's just a dream which does not deserve such a nice term..

As far as I know, none have been released. And it doesn't even really make sense, because, as I said, the models aren't copyrightable to begin with and therefore aren't licensable either.

However, plenty of open source software exists. The fact that open source models don't exist doesn't excuse attempts to falsely claim the prestige of the phrase "open source".

> models aren't copyrightable to begin with

You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.

> The fact that open source models don't exist

The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong. And yes, I do check the license of stuff that I use as I work with companies that care about such matters.

> As far as I know ...

Now you know better.

> You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.

This is only true in jurisdictions that follow the sweat of the brow doctrine, where effort alone without creativity is considered enough for copyright. In other places, such as the USA, collections of facts are not copyrightable and a minimal amount of creativity is required for something to qualify as copyrightable. The phone book is an example that is often used, actually, to demonstrate the difference.

https://en.wikipedia.org/wiki/Sweat_of_the_brow

I agree, but that can't happen with the vast majority of these models because they're trained on unlicensed data so they can't slap an open source license on the training data and distribute it.

I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.

I respect the opinion that it's not truly open source unless they release the training data as well, but I've decided not to make that part of my own personal litmus test here.

My reasoning is that knowing something is "open source" helps me decide what I legally can or cannot do with it when building my own software. Not having access to the training data downs affect my legal rights, it just affects my ability to recompile myself. And I don't have millions of dollars of GPUs so that isn't so important to me, personally.

> We are excited to introduce Flux

I'd suggest re-wording the blog post intro, it reads as if it was created by Fal.

Specific phrases to change:

> Announcing Flux

(from the title)

> We are excited to introduce Flux

> Flux comes in three powerful variations:

This section also comes across as if you created it

> We invite you to try Flux for yourself.

Reads as if you're the creator

i would give them a break, so many things exist in the tech sector that being completely original is basically impossible, unless you name your thing something nonsensical

also search engines are context aware, if your search history is full of julia questions, it will know what you're searching for

There was a looong distracting thread a month ago about something similar, niche language, might have been Julia, had a package with the same name as $NEW_THING.

I hope this one doesn't stir as much discussion. It has 4000 stars, there isnt a large mass of people who view the world through the lens of "Flux is ML library". No one will end up in a "who is on first?" discussion because of it. If this line of argument is held sacrosanct, it ends up in an infinite loop until everyone gives up and starts using UUIDs.

I think we've generally run out of names to give projects and need to start reusing names. Maybe use letters to disambiguate them.

Flux A is the ML library

Flux B is the T2I model

Flux C is the React library

Flux D is the physics concept of power per unit area

Flux E is the goo you put on solder

It would be nice to understand limits of the free tier. I couldn't find that anywhere. I see pricing, but I'm generating images without swiping my credit card.

If it's unlimited or "throttled for abuse," say that. Right now, I don't know if I can try it six times or experiment to my heart's desire.

Congrats Burkay - the model is very impressive. One area I’d like to see improved in a flux v2 is knowledge of artist styles. Flux cannot respond to requests asking for paintings in the style of David Hockney, Norman Rockwell, Edgar Degas, — it seems to have no fine art training at all.

I’d bet that fine art training would further improve the compositional skills of the model, plus it would open up a range of uses that are (to me at least) a bit more interesting than just illustrations.

>Flux cannot respond to requests asking for paintings in the style of David Hockney, Norman Rockwell

Does it respond to any names? I noticed SD3 removed all names to prevent recreating famous people but as a side effect lost the very powerful ability to infer styles from artist names too.

It's "just" another diffusion model, although a very good one. Those people are probably in there even if its text encoder doesn't know about them. So you can find them with textual inversion.

thanks for hosting the model! i created an account to try it out, you started emailing me with “important notice: low account balance - action required” and now it seems like there’s no way for me to unsubscribe or delete my account. is that the case? thanks!

That’s not really fair to conclude that the training data contains vanity fair images since the prompt includes “by Vanity Fair”.

I could write “with text that says Shutterstock” in the prompt but that doesn’t necessairly mean the dataset contains that

The logo has the same exact copyrighted typography as the real Vanity Fair logo. I've also reproduced the same-copyrighted-typography with other brands with identical composition as copyrighted images. Just asking it "Vanity Fair cover story about Shrek" at a 3:2 ratio gives it a composition identical to a Vanity Fair cover very consistently (subject is in front of logo typography partially obscuring it)

The image linked has a traditional www watermark in the lower-left as well. Even something innocous as a "Super Mario 64" prompt shows a copyright watermark: https://x.com/minimaxir/status/1819093418246631855

Must we always jump to Nazis?

This is like the fifth time I see someone paraphrasing Niemöller in an ai context, and it's exhausting. It's also near impossible to take the paraphraser seriously.

More to the point, AI is a tool. I could just as well infringe on vanity fair IP using ms-paint. Someone more artistic than me could make a oil-on-canvas copy of their logo too.

Or, to turn your own annoying "argument" against you:

First they came for AI models, and I did not speak out, because I wasn't using them. Then they came for Photoshop, and I did not speak out, because I had never learned to use it. Then they came for for oil and canvas, and now there are no art forms left for me.

Just to be clear: you're comparing the collapse of the creative restrictions which the state has cleverly branded "intellectual property" to... the holocaust?

Of all of the instances on HN of Godwin's law playing out that I've ever seen, this one is the new cake-taker.

You don't need an A100, you can get a used 32GB V100 for $2K-$3K. It's probably the absolute best bang-for-buck inference GPU at the moment. Not for speed but just the fact that there are models you can actually fit on it that you can't fit on a gaming card, and as long as you can fit the model, it is still lightyears better than CPU inference.

>FLUX.1 [dev]: The base model

>FLUX.1 [schnell]: A distilled version of the base model that operates up to 10 times faster

It should also be noted that "schnell" is the German word for "fast".

What's the difference between pro and dev? Is the pro one also 12B parameters? Are the example images on the site (the patagonia guy, lego and the beach potato) generated with dev or pro?

I think they are mainly -dev and -schnell. Both models are 12B. -pro is the most powerful and raw, -dev is guidance distilled version of it and -schnell is step distilled version (where you can get pretty good results with 2-8 steps).

I think they may have turned on the gating some time after this was submitted to HackerNews. Earlier this morning I definitely ran the model several times without signing in at all (not via GitHub, not via anything). But now it says "Sign in to run".

Tested it using prompts from ideogram (login walled) which has great prompt adherence. Flux generated very very good images. I have been playing with ideogram but i don't want their filters and want to have a similar powerful system running locally.

If this runs locally, this is very very close to that in terms of both image quality and prompt adherence.

I did fail at writing text clearly when text was a bit complicated. This ideogram image's prompt for example https://ideogram.ai/g/GUw6Vo-tQ8eRWp9x2HONdA/0

> A captivating and artistic illustration of four distinct creative quarters, each representing a unique aspect of creativity. In the top left, a writer with a quill and inkpot is depicted, showcasing their struggle with the text "THE STRUGGLE IS NOT REAL 1: WRITER". The scene is comically portrayed, highlighting the writer's creative challenges. In the top right, a figure labeled "THE STRUGGLE IS NOT REAL 2: COPY ||PASTER" is accompanied by a humorous comic drawing that satirically demonstrates their approach. In the bottom left, "THE STRUGGLE IS NOT REAL 3: THE RETRIER" features a character retrieving items, complete with an entertaining comic illustration. Lastly, in the bottom right, a remixer, identified as "THE STRUGGLE IS NOT REAL 4: THE REMI

Otherwise, the quality is great. I stopped using stable diffusion long time ago, the tools and tech around it became very messy, its not fun anymore. Been using ideogram for fun but I want something like ideogram that I can run locally without any filters. This is looking perfect so far.

This is not ideogram, but its very very good.

Ideogram handles text really well but I don’t want to be on some weird social network.

If this thing can mint memes with captions in it on a single node I guess that’s the weekend gone.

Thanks for the useful review.

whenever I see a new model I always see if it can do engineering diagrams (e.g. "two square boxes at a distance of 3.5mm"), still no dice on this one. https://x.com/seveibar/status/1819081632575611279

Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)

It'll probably come suddenly. It has been fascinating to me watching the journey from Stable Diffusion 1 to 3. SD1 was a very crude model, where putting a word in the prompt might or might not add representations of the word to the image. Eg, using the word "hat" somewhere in the prompt might do literally nothing or suddenly there were hats everywhere. The context of the word didn't mean much to SD1.

SD2 was more consistent about the word appearing in the image. "hat" would add hats more reliably. Context started to matter a little bit.

SD3 seems to be getting a lot better at the idea of scene composition, so now specific entities can be prompted to wear hats. Not perfect, but noticeably improved from SD2.

Extrapolating from that, we're still a few generations from being able to describe things with the precision of an engineering diagram - but we're heading in the right direction at a rapid clip. I doubt there needs to be any specialist work yet, just time and the improvement of general purpose models.

Can’t you get this done via an LLM and have it generate code for mermaid or D2 or something? I’ve been fiddling around with that a bit in order to create flowcharts and datamodels, and I’m pretty sure I’ve seen at least one of those languages handle absolute positioning of object.

I have likewise been utterly unable to get it to generate images that look like preliminary rapid pencil sketches. Suggestions by experienced prompters welcome!

>> Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)

That seems like a good use for a speech driven assistant that know how to use PC desktop software. Just talk to a CAD program and say what you want. This seems like a long way off but could be very useful.

It appears the model does have some "sanity" restrictions from whatever its training process is that limits some of the super weird outputs.

"A horse sitting on a dog" doesn't work but "A dog sitting on a horse" works perfectly.

The quality is difficult to judge consistently as there's variants among seed with the same prompt. And then there's the problem of cherry picked examples making the news. So I'm building a community gallery to generate Pro images for free, hope this at least increases the sample size https://fluxpro.art/

You're not. I'm surprised at their selections because neither the cooking one nor the beach one adhere to the prompt in very well, and that first one only does because it prompt largely avoids much detail altogether. Overall, the announcement gives the sense that it can make pretty pictures but not very precise ones.

Well, that's nothing new, but it doesn't matter to dedicated users because they don't control it just by typing in text prompts. They use ComfyUI, which is a node editor.

Vast majority of comparisons aren't really putting these new models through their paces.

The best prompt adherence on the market right now BY FAR is DALL-E 3 but it still falls down on more complicated concepts and obviously is hugely censored - though weirdly significantly less censored if you hit their API directly.

I quickly mocked up a few weird/complex prompts and did some side-by-side comparisons with Flux and DALL-E 3. Flux is impressive and significantly performant particularly since both the dev/shnell models have been confirmed by Black Forest to be runnable via ComfyUI.

https://mordenstar.com/blog/flux-comparisons

I did put them through pro/dev as well just to be safe. The quality changes and you can play with guidance (cranking it all the way to 10) but it doesn't make a significant difference for these prompts from what I could tell.

Several iterations and these were the best I got out of schnell, dev and pro respectively for the following prompt:

"a fantasy creature with the body of a dragon and a beachball for a head, hybrid, best quality, shadows and lighting, fantasy illustration muted"

https://gondolaprime.pw/pictures/schnell-dev-pro.jpg

> Nearby, anthropomorphic fruits play beach volleyball.

This is missing from the image. The generated image looks well, but while reading the prompt I was surpised it was missing

Bit annoying signup...Github only...and github account creation is currently broken "Something went wrong". Took two tries and two browsers...

I had the same "something went wrong" experience, but on retrying the "sign in to run" button, it was fine and had logged me in.

Gave me a credit of 2USD to play with.

Mmmh, trying my recent test prompts, still pretty shit. F.e. whereas midjourney or SD do not have a problem to create a pencil sketch, with this model (pro), it always looks more like a black and white photograph or digital illustration or render. It is also like all the others apparently not able to follow instructions on the position of characters. (i.e. X and Y are turned away from each other).

> FLUX.1 [dev]: The base model, open-sourced with a non-commercial license

...then it's not open source. At least the others are Apache 2.0 (real open source) and correctly labeled proprietary, respectively.

Hey, great work over at fal.ai to run this on your infrastructure and for building in a free $2 in credits to try before buying. For those thinking of running this at home, I'll save you the trouble. Black Forest Flux did not run easily on my Apple Silicon MacBook at this time. (Please let me know if you have gotten this to run for you on similar hardware.) Specifically, it falls back to using CPU which is very slow. Changing device to 'mps' causes error "BFloat16 is not supported on MPS"

This gives you no info on how the model works. what is being applied is fal's post-inference "is this NSFW?" filter model

So your censorship investigation (via boobs) is testing a completely different, unrelated, model.

It does provide information. Regardless of whether they use a post-inference filter, we now know that the model itself was trained on and can produce NSFW content. Compare this to SD3 which produces a noise pattern if you request naked bodies.

(Also you can download the model itself to check the local behaviour without extra filters. Unfortunately I don't have time to do it right now, but I'd love to know)

Right, that (the black bars) gives no info on how the model works. Thus, you'd love to "know more". ;)

Rest is groping for a reason to make "model is censored [classifier made POST return black image instead of boobs]" something sensical.

So I'm forced to signup and give my email for a supposed trial, only to be immediately told by email that I have a "Low Account Balance - Action Required"? Seriously?

I wonder if the key behind the quality of the MidJourney models, and this models, is less about size + architecture and more about the quality of images trained on.

It looks like this is the case for LLMs, that the training quality of the data has a significant impact on the output quality of the model, which makes sense.

So the real magic is in designing a system to curate that high quality data.

Midjourney unquestionably has heavy data set curation and uses RLHF from users.

You don't have to speculate on this as you can see that custom models for SDXL for instance perform vastly better than vanilla SDXL at the same number of parameters. It's all data set and tagging.

That is technically true, but when the base model is wasting parameter information on poorly tagged, watermarked stock art and other garbage images, it's not really a meaningful distinction. Better data makes for better models, nobody cares about how well a model outputs trash.

Ok, but you're severely misrepresenting the importance of things. Base SDXL is a fine model. Base SDXL is going to be much better than a materially smaller model that you've retrained with "good data".

It's the quality of the image text pair not the image alone but midjourney is not a model it's a suite of models that work in conjunction. They have an llm in the front to optimize the user prompts, they use SAM models, controlnet models for poses that are in high demand and so much more. That's why you can't really compare foundation models anymore because there are none.

No, it’s definitely the size. Tiny LLMs are shit. Stable Diffusion 3’s problem is not that that its training set was wildly different, it’s that it’s just too small (because the one released so far is not the full size).

You can get better results with better data, for sure. And better architecture, for sure. But raw size is really important the difference in quality for models, all else held equal, is HUGE and obvious if you play with them.

I would agree - midjourney is getting a free labour since many of their generations are not in secret mode (require pro/mega subscription) so prompts and outputs are visible to everyone. Midjourney rewards users to rating those generations. I wouldn't be surprised if there are some bots on their discord that are scraping those data for training their own models.

Is the architecture outlined anywhere? Any publications or word on if they will publish something in the future? To be fair to them, they seemed to have launched this company today so I doubt they have a lot of time right now. Or maybe I just missed it?

I don't have anything to compare it to as I'm not that familiar with other diffusion models in the first place. I was kind of hoping to read the key changes they made to the diffusion architecture and how they collected and curated their dataset. I'd assume their are also using LAION but I wonder if they are doing anything to filter out low quality images (separate from what LAION atheistic already does). Or maybe if they have their own dataset.

This is great and unbelievably fast! I noticed a small note saying how much this would cost and how many images you can create for $1.

I assume you’re offering this as an API? Would be nice to have pricing page as I didn’t see one on your website.

These venture funded startups keep releasing models for free without a business model in sight. I am all for open source but worry it is not sustainable long term.

At this point the only thing an AI startup has to do to get people to spend money on the model is to:

-not censor it

-not be doing prompt injection

It's very easy, which is why no other firm is capable of it.

I'm really impressed at its ability to output pixel art sprites. Maybe the best general-purpose model I've seen capable of that. In many cases its better than purpose-built models.

Great product. BTW I am new to this technology can you please tell me what is the parameter given to Model to make it look like real life image ?

Wow.

I have seen a lot of promises made by diffusion models.

This is in a whole different world. I legitimately feel bad for the people still a StabilityAI.

The playground testing is really something else!

The licensing model isn’t bad, although I would like to see them promise to open up their old closed source models under Apache when they release new API versions.

The prompt adherence and the breadth of topics it seems to know without a finetune and without any LORAs, is really amazing.

How long until nsfw fine tunes? Don’t pretend like it’s not on all of y’all’s minds, since over half of all the models on Civit.ai are NSFW. That’s what folks in the real world actually do with these models.

Works great as is right now, I can see some workflows being affected or having to wait for an update, but even those can do with some temporary workarounds (like having to load another model for later inpainting steps).

So if you're wanting to experiment and have a 24GB card, have at it!

Yeah I mean like controlnet / ipadapter / animateddiff / in painting stuff

I don’t feel like base models are super useful. Most real use cases depend on being able to iterate on consistent outputs imo.

I have had a very bad experience trying to use other models to modify images but I mostly do anime shit and maybe styles are less consistently embedded into language for those models

I think they may have turned on the gating some time after this was submitted to HackerNews. Earlier this morning I definitely ran the model several times without signing in at all (not via GitHub, not via anything). But now it says "Sign in to run".

（评论） (comments)

（评论）
(comments)