(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39974374

本文讨论了 AI2 开发的大型语言模型开放语言模型 2 (OLMo),并将其与其他类似模型进行了比较。 作者对 AI2 没有将 Mistral 7b 纳入比较表示惊讶,并指出并非所有 OLMo 的训练数据都是公开的。 他们还对存储库中发现未经许可的代码所带来的版权影响表示担忧。 此外,作者指出,OLMo 似乎是 AMD GPU 上大型语言模型的首批成功实现之一,尽管他们质疑过渡是否顺利。 文本还谈到了与该主题相关的媒体文章的访问受限的问题,强调了由于缺乏有意义的内容而购买订阅缺乏价值。 最后,作者提到在推理过程中遇到重复标记的困难,并呼吁继续研究大型语言模型的能力和伦理考虑。

相关文章

原文


If I read the license correctly, it seems that if you want to use the LLM, you need to tell the authors what you are doing with it.

Am I reading this correctly? https://allenai.org/licenses/impact-mr

“Derivative Impact Reports. AI2 seeks to encourage transparency around Derivatives through the use of Derivative Impact Reports, available here. Before releasing a Model Derivative or Data Derivative, You will share with AI2 the intended use(s) of Your Derivative by completing a Derivative Impact Report or otherwise providing AI2 with substantially similar information in writing. You agree that AI2 may publish, post, or make available such information about Your Derivative for review by the general public.

You will use good faith efforts to be transparent about the intended use(s) of Your Derivatives by making the information freely available to others who may access or use Your Derivatives. You acknowledge that Derivative Impact Reports are not intended to penalize any good faith disclosures about Derivatives. Accordingly, if You initiate or participate in any lawsuit or other legal action against a Third Party based on information in such Third Party’s Derivative Impact Report, then this MR Agreement will terminate immediately as of the date such lawsuit or legal action is filed or commenced.”



> if You initiate or participate in any lawsuit or other legal action ... this MR Agreement will terminate immediately

Is this legal? Restricting legal options by making an agreement dependant on it?



Weird. So even if these things are well intentioned, seems like they don't have any teeth.

Are there any out there that have licenses which are (dare I say) simpler, like the GPL?



Great to see e2e openness. One of the only true OSS models out there, vs most of the models releasing the binaries (weights). Surprised that they didn’t mention Mistral 7b in the comparisons.


Notably “The Pile” doesn’t seem to be part of the training data. So this might be more sound legally than many other “open” LLMs


It is absolutely absolutely packed with unlicensed, copyrighted data.

Books3 is the most notable example - nearly 200,000 pirated ebooks - but a lot of the rest of it is (unlicensed) scraped web data.

The legal questions over whether this is a problem are currently still unresolved. Many people are also bothered by the ethical implications, which is a separate issue from the legal questions.



Ethics are a lot more nuanced and change a lot faster than laws.

Heck, a large fraction of ethics seem to be so fickle that they’re subject to potential revision by every generation.

In fact, I’d argue that those revisions are a significant portion of how one generation distinguishes itself from their parents.

Yet strangely every generation feels like they have arrived at a set of “universal laws” in their ethics.



I took a quick peak at this last time it was mentioned and it had dozens of my own repos of unlicensed source code in it. All of that was published on GitHub and made public, but much of it has no license specified.


Is this one of the first LLMs of note that was successfully trained on AMD GPUs? I wonder how seamless the process was and if they faced any issues there.


They often require log in to see the whole article. Later they cap your access to articles to N per some period of time. The only way around that is to purchase a subscription. Given the weak offering of Medium, it’s seldom worth the $/month cost of a subscription for the few jewels that might appear.


> 1. No biases. Following LLaMA, PaLM, and others, we exclude all bias terms from our architecture in order to improve training stability.

What does this mean? What is a "bias term"?



What does the risk classification applied to the dataset actually mean? The licensing page [1] AI2 provides for their datasets is really nice but it doesn't really explain [2] what risk means in the context.

Does it mean "risk that the items contained in this set are licensed in a manner incompatible with its use in a training dataset"?

[1] https://allenai.org/impact-license

[2] "the AI2 ImpACT Licenses are artifact-agnostic and are instead structured according to the risk level we’ve assigned a given artifact"



It's odd. Running inference on this (and other models in its class) and I keep running into a "repeating token" situation with moderate-to-long context windows.

It feels almost as if, during inference, the model hits some format of local minimum that it careens around, and while temperature seems to affect this - it doesn't really fix it.

at temp 0.2:

> [{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: I am the man. AGENT: Thank you. I am an agent of the Bureau of Investigation. I am here to investigate the following: 1. The following: 2. The following: 3. The following: 4. The following: 5. The following: 6. The following: 7. The following: 8. The following: 9. The following: 10. The following: 11. The following: 12. The following: 13. The following: 14. The following: 15. The following: 16. The following: 17. The following: 18. The following: 19. The following: 20. The following: 21. The following: 22. The following: 23. The following: 24. The following'}]

...and at temp 0.4:

> [{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: My name is not important. AGENT: My name is Agent Cyanide. MYSTERIOUS STRANGER: Agent Cyanide. AGENT: I am an agent of the Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations'}]



From what I heard through the grapevine, OLMo is not nearly the best model for its size or compute budget. Apparently something didn’t quite go right and AI2 didn’t have the money to train until they got it right.


... this can get a little goofy even with do_sample=False and no temp:

| [{'generated_text': "DAUGHTER: tell me a story FATHER: but it's late DAUGHTER: please? FATHER: okay, once upon a time there was a little girl who lived in a little house with her mother and father and her brother and sister and her dog and her cat and her hamster and her fish and her bird and her rabbit and her horse and her cow and her sheep and her goat and her pig and her chicken and her duck and her turkey and her goose and her llama and her alpaca and her camel and her zebra and her giraffe and her elephant and her hippopotamus and her rhinoceros and her kangaroo and her koala and her panda and her bear and her wolf and her fox and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and"}]



There actually was a podcast around that concept when (I think) GPT2 was current.

Basically one generated story per day. Absurd in places.



I commented this somewhere else, but word in the ether is that OLMo is not actually that good of a model given its size and compute budget. I am not entirely sure why, and it’s still good to have the full recipe for at least one model out in the open, but the current OLMo definitely is a cautionary tale for people training their own model.


This is the only LLM that is exciting to me. Clearly, LLMs are powerful tools that may end up replacing search and even go much further than simple searches by performing the research for you and producing final answers. Closed models like those from Open AI (ironically) or Anthropic cannot be audited. When most users will end up blindly hitting Microsoft’s Copilot button, which they are forcing OEMs to adopt, who’s to say how the information a user gets is being curated or manipulated by OpenAI or Microsoft or whoever?

We’ve already seen real world examples of severe bias injected into LLMs. For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...). I don’t think we can just let closed AI systems take over society when they can easily be manipulated by the model owners without transparency.

What I like about AI2’s approach with OLMo is that they are actually open, not just trading on the marketing benefits of the word “open”. Most “open” models are just open weights not open source. That’s like sharing an executable and not the source code. In my view, being open means that others have to be able to reproduce the final product (the model) if they wanted to and had the means (in terms of training hardware). It also means that they should be able to use whatever is provided freely for any purpose, rather than being subject to proprietary licensing. AI2 shares the training source code, training data, evaluation suite, and the model weights that they’ve produced by running the training process. It all uses the Apache license. And it’s also interesting that they used AMD hardware to train this LLM rather than Nvidia/CUDA.

Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others. My hope is that truly open models like OLMa keep developing quickly enough to also keep up. Lastly, I hope that regulation does not block open source private development of AI systems. These systems will be the vehicle for speech for much of society in the future, so blocking private AI systems is a lot like restricting speech. But leaving that aside, open development will also drive innovation and reducing competitive pressure will hurt innovation.



Pet peeve: Google's Gemini LLM model was not to blame for the image generation weirdness.

That would be like blaming DALL-E weirdness on GPT-4.

Unfortunately, Google marketing decided to slap the "Gemini" brand on both the end-user interface used to interact with the model AND the actual model itself, hence people constantly calling out Gemini-the-model for weird decisions made as part of Gemini-the-user-interface.



> That would be like blaming DALL-E weirdness on GPT-4.

Actually when you trigger DALL-E through GPT-4 (i.e. with the LLM generating the prompt to give the diffusion model then returning the resulting image to the user) the LLM's system instructions [1] say "7. Diversify depictions of ALL images with people to always include always DESCENT and GENDER for EACH person using direct terms." and a bunch of stuff along those lines.

In OpenAI's system this doesn't always trigger; if the user asks for an image of trash being collected, the user hasn't explicitly asked for any people to be depicted, so the LLM doesn't find anything in the prompt that needs diversity added. The trash-being-collected prompt gets passed to DALL-E unmodified, and the resulting image has all male workers.

[1] https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExper...



> Google's Gemini LLM model was not to blame for the image generation weirdness. That would be like blaming DALL-E weirdness on GPT-4.

The way I read the Gemini technical report, it seemed like, unlike GPT-4 vs DALL-E, Gemini was pretrained with multimodal outputs. Is that not the case?



Is that right? I didn't think Gemini was generating images directly, I assumed it was using a separate image generation tool.

The paper here https://arxiv.org/pdf/2403.05530.pdf has a model card for Gemini 1.5 Pro that says:

    Output(s): Generated text in response to the input
    (e.g., an answer to the question, a summary of
    multiple documents, comparing documents/videos).


Huh, that is true in both the model cards of Gemini 1.5 Pro and Gemini 1.0.

That feels like it runs counter to this statement from the Gemini 1.0 technical report[0]:

> Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs

[0]: https://arxiv.org/pdf/2312.11805.pdf



Do we even know if these licenses are binding? AFAIK we have no ruling on whether model weights are even eligible for copyright. They're machine-produced derivatives of other work, so it's not a guarantee that copyright protects them.


That’s a great point and I hope more people speak up to treat models as just numerical derivative works so they aren’t automatically granted these protections. It’s better if society meaningfully debates this and chooses the right approach.


> Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others.

Since when? I’ve had the complete opposite experience.



> For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...).

Such a bizarre take to call this "dystopian".

The model happened to create some out-there pictures. I mean, it's no more outlandish then giant dragons and snakes and such being created yet the thought of a person of color being something historically inaccurate is this massive outcry against revisionism? Who cares?

Besides, the article identifies the probable goal which was to eliminate very known biases in existing models (i.e. when generating "angry person" you mainly got black people). Clearly this one wasnt tuned well for that goal, but the objective is not only noble but absolutely should be required for anyone producing LLM models.



If I may explain: the dystopian part to me is the lack of transparency around training code, training data sources, tuning, meta prompting, and so forth. In Google’s case, they’re a large corporation that controls how much of society accesses information. If they’re secretly curating what that information is, rather than presenting it as neutrally as they can, it does feel dystopian to me. I’d like transparency as a consumer of information, so I know to the extent possible, what the sources of information were or how I am being manipulated by choices the humans building these systems made.

I appreciate the issue you’re drawing attention to in the example you shared about images of an angry person. I think I agree that focused tuning for situations like that might be noble and I would be okay with a model correcting for that specific example you shared. But I also struggle with how to clearly draw that line where such tuning may go too far, which is why I favor less manual biasing. But I disagree that such tuning should be required, if you meant required by the law. Like with speech or art in general, I think anyone should be able to produce software systems that generate controversial or offensive speech or art. Individual consumers can choose what they want to interact with, and reject LLMs that don’t meet their personal standards.



The hype around LLMs won't last past 2030 I suppose. LLMs - we have statistical inference soup that gets outdated like stagnant pond water and by each passing day, becoming less accurate.

I am curious how long the hype wave lasts. Ones I have recently seen was K8S. It settled down and won TBH.



The transformer architecture probably won't last and we might start calling them something else, but I can't see something that could reasonably be called an LLM going away any time soon.
联系我们 contact @ memedata.com