(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40996058

新的 Mistral NeMo 由 Mistral AI 与 NVIDIA 合作开发,是一种功能强大的语言模型,上下文窗口最多可容纳 128K 个令牌。 与同尺寸范围内的模型相比,它提供了改进的推理、世界知识和编码准确性。 Mistral NeMo 采用通用架构,使其用户友好并与使用 Mistral 7B 的系统兼容。 预训练的基础和指令调整的检查点可在 Apache 2.0 许可证下用于研究和企业目的。 该模型旨在在单个 NVIDIA L40S 或 GeForce RTX 4090 或 RTX 4500 等同等硬件上高效运行。它速度更快,并提供增强的安全和隐私功能。 该模型在性能、许可和能源要求方面超出了预期,同时由于其 8 位量化感知而需要更少的计算能力。 尽管像 70B 这样的较大模型可能需要降低精度才能适应消费级 GPU,但 Mistral NeMo 提供了与较小模型相当的令人印象深刻的结果,同时还拥有更大​​的上下文窗口。

相关文章

原文


> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.



You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move



Aren't small models useful for providing a language-based interface - spoken or in writing - to any app? Tuned specifically for that app or more likely enriched via RAG and possibly also by using function calling?

It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.



I wonder if a "mixture of models" is going to become more common for real-world use cases (i.e. where latency & dollar budgets are real constraints). Chain together a huge model for reasoning, a small model for function calling/RAG, a medium model for decoding language generation. I'm definitely not dismissing 7B models as irrelevant just yet.



I usually tell the model that I will be testing its reasoning capabilities by describing a scenario and then asking questions about the evolving scenario.

I typically give it a description of a limited environment with objects in it, and say that “we “ are in this environment. I then describe actions that I take within the environment and ask questions about the updated world-state that must be inferred from the actions. This tests a lot of “common sense” reasoning skills, which I find to be more important for real world tasks than logic puzzle type reasoning.



Except Llama 3 8b is a significant improvement over llama 2, which was basically so terrible that there was a whole community building fine tunes that are better than what the multi billion dollar company can do using a much smaller budget. With llama 3 8b things have shifted towards there being much less community fine-tunes that actually beat it. The fact that Mistral AI can still build models that beat it, means the company isn't falling too far behind a significantly better equipped competitor.

What's more irritating is that they decided to do quantization aware training for fp8. int8 quantization results in an imperceptible loss of quality that is difficult to pick up in benchmarks. They should have gone for something more aggressive like 4-bit, where quantization leads to a significant loss in quality.



Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.



They did quantization aware training for fp8 so you won't get any benefits from using more than 12GB of RAM for the parameters. What you might be using more RAM is the much bigger context window.



Welp, my data point of one shows you need more than 8 GB of vRam.

When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"

This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.



What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?



> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.



SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.

Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.

What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.



SentencePiece is not a different algorithm to WordPiece or BPE, despite its naming.

One of the main pulls of the SentencePiece library was the pre-tokenization being less reliant on white space and therefore more adaptable to non Western languages.



Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.



These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.



There are a lot of models coming out, but in my view, most don't really matter or move the needle. There are the frontier models which aren't open (like GPT-4o) and then there are the small "elite" local LLMs like Llama3 8B. The rest seem like they are mostly about manipulating benchmarks. Whenever I try them, they are worse in actual practice than the Llama3 models.



I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over.

If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless.



You don't need a 4090 at all. 16 bit requires about 24GB of VRAM, 8bit quants (99% same performance) requires only 12GB of VRAM.

That's without the context window, so depending on how much context you want to use you'll need some more GB.

That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)

This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.

You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.



Comparing this to 70b doesn't make sense: this is a 12b model, which should easily fit on consumer GPUs. A 70b will have to be quantized to near-braindead to fit on a consumer GPU; 4bit is about as small as you can go without serious degradation, and 70b quantized to 4bit is still ~35GB before accounting for context space. Even a 4090 can't run a 70b.

Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.

Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.



> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

From Mistral's page about Tekken:

> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.

Does that mean that Mistral found that BPE is more efficient than unigram models?

Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.



I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.



I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.



no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.



I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.



Generally yes, with caveats.

There was some research showing that training a model on facts like "the mother of John Smith is Alice" but in German allowed it to answer questions like "who's the mother of John Smith", but not questions like "what's the name of Alice's child", regardless of language. Not sure if this holds at larger model sizes though, it's the sort of problem that's usually fixable by throwing more parameters at it.

Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.



> Language models definitely do generalize to some extend and they're not "stochastic parrots" as previously thought, but there are some weird ways in which we expect them to generalize but they don't.

Do you have any good sources that explain this? I was always thinking LLMs are indeed stochastic parrots, but language (that is the unified corpus of all languages in the training data) already inherently contains the „generalization“. So the intelligence is encoded in the language humans speak.



> Do you have any good sources that explain this?

The most famous result is OthelloGPT, where they trained a transformer to complete lists of Othello moves, and the transformer generated an internal model of where the pieces were after each move.

The rough consensus is that if you train a model to predict the output of a system for long enough with weight decay and some nebulous conditions are met (see "lottery ticket hypothesis"), eventually your model develops an internal simulation of how the system works because that simulation uses fewer weights than "memorize millions of patterns found in the system", and weight decay "incentivizes" lower-weight solutions.



I don't have explanations but I can point you to one of the papers: https://arxiv.org/pdf/2309.12288 which calls it "the reversal curse" and does a bunch of experiments showing models that are successful at questions like "Who is Tom Cruise’s mother?" (Mary Lee Pfeiffer) will not be equally successful at answering "Who is Mary Lee Pfeiffer’s son?"


Isn't that specific case just a matter of not having enough data _explicitly_ stating the reverse? Seems as if they are indeed stochastic parrots from that perspective.



Anecdata, but I did some continued pretraining on a toy LLM using machine-translated data; of the original dataset.

Performance improved across all benchmarks; in English (the original language).



Am I understanding correctly? You look an English dataset, trained an LLM, machine translated the English dataset to e.g. Spanish, continued training the model, and performance for queries in English improved? That’s really interesting.



no, it is basically an 'auto-correct' spell checker from the phone. It only knows what it was trained on. But it has been shown that a coding LLM that has never seen a programming language or a library can "learn" a new one faster than, say, a generic LLM.



That's not true, LLMs can answer questions in one language even if they were only trained on that data in another language.

IE you train an LLM on both English and French in general, but only teach it a specific fact in French, it can give you that fact in English



I have to say, the experience of trying to sign up for Nvidia Enterprise so you can try the "NIM" packaged version of this model, is just icky and and awful now that I've gotten used to actually free and open models and software. It feels much nicer and more free to be able to clone llama.cpp and wget a .gguf model file from huggingface without any registration at all. Especially since it has now been several hours since I signed up for the Nvidia account and it still says on the website "Your License Should be Active Momentarily | We're setting up your credentials to download NIMs."

I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.



Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!



I think it's mostly the memory bandwidth though that makes the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM won't come near that. I'm sure a lot of optimisations can be had but I think GPUs will still be significantly ahead.

Macs are so good at it because Apple solder the memory on top of the SoC for a really wide and low latency connection.



This is a good and valid comment. It is difficult to predict the future, but I would be curious what the best case theoretical performance of an LLM on a typical x86 or ARM system with DDR4 or DDR5 RAM. My uneducated guess is that it can be very good, perhaps 50% the speed of a specialized GPU/RAM device. In practical terms, the CPU approach is required for very large contexts, up to as large as the lifetime of all interactions you have with your LLM.



DIMM slots won't work for GPU VRAM due to the higher speeds, tighter signalling, and dense packing of memory on wide buses. Take a look at the speeds DDR5 is running at in a typical Xeon server, and compare to GDDR6. This is the problem LPCAMM2 was developed to solve for modern x86 CPUs in laptops and desktops. Seeing it applied to GPUs would be great.



We're working on it, except that there is a change to the tokenizer which we're still working through in our conversion scripts. Unfortunately we don't get a heads up from Mistral when they drop a model, so sometimes it takes a little bit of time to sort out the differences.

Also, I'm not sure if we'll call it mistral-nemo or nemo yet. :-D



Adding to this: If the default is too slow look at the more heavily quantized versions of the model, they are smaller at moderate cost in output quality. Ollama can split models between GPU and host memory but the throughput dropoff tends to be pretty severe.



Ollama depends on llama.cpp as its backend, so if there are any changes that need to be made to support anything new in this model architecture or tokenizer, then it will need to be added there first.

Then the model needs to be properly quantized and formatted for GGUF (the model format that llama.cpp uses), tested, and uploaded to the model registry.

So there's some length to the pipeline that things need to go through, but overall the devs in both projects generally have things running pretty smoothly, and I'm regularly impressed at how quickly both projects get updated to support such things.



You will need enough VRAM, 1080ti is not going to work very well, maybe get a 3090 with 24GB VRAM.

I think it should also run well on a 36GB MacBook Pro or probably a 24GB Macbook Air



Yes.

If you're on a Mac, check out LM Studio.

It's a UI that lets you load and interact with models locally. You can also wrap your model in an OpenAI-compatible API and interact with it programmatically.



I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?



> but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.



They could just create a custom license based of Apache 2.0 that allows sharing but constraints some specific behavior. It won't be formally Open Source, but will have enough open source spirit that academics or normal people will be happy to use it.



I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.



llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open



Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.



Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.



Can you help me understand why people seem to think of Connections as a more robust indicator of (general) performance than benchmarks typically used for eval?

It seems to me that while the game is very challenging for people it’s not necessarily an indicator of generalization. I can see how it’s useful - but I have trouble seeing how a low score on it would indicate low performance on most tasks.

Thanks and hopefully this isn’t perceived as offensive. Just trying to learn more about it.

edit: I realize you yourself indicate that it's "just one benchmark" - I am more asking about the broader usage I have seen here on HN comments from several people.



In practice, it's fine to stick with "just" 8k or 16k or 32k. If you're working with data of over 128k tokens I'd personally not recommend using an open model anyway unless you know what you're doing. The models are kinda there, but the hardware mostly isn't.

This is only realistic right now for people with those unified memory MacBook or for enthusiasts with Epyc servers or a very high end workstation built for inference.

Anything above that I don't consider "consumer" inference



Keep in mind that Gemma is a larger model but it only has 8k context. The Mistral 12B will need less VRAM to store the weights but you'll need a much larger KV cache if you intend to use the full 128k context, especially if the KV is unquantized. Note sure if this new model has GQA but those without it absolutely eat memory when you increase the context size (looking at you Command R).



The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.



Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.



They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.



For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"



Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.



I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.



I work in the field. The reason has not been mentioned yet.

It's because (for an unknown reason), having coding and software development in the training mix is really helpful at most other tasks. It improves everything to do with logical thinking by a large margin, and that seems to help with many other downstream tasks.

Even if you don't need the programming, you want it in the training mix to get that logical thinking, which is hard to get from other resources.

I don't know how much that is true for legal or financial resources.



It's just easier to iterate and improve on a coding specialist AI when that is also the skill required to iterate on said AI.

Products that build on general LLM tech are already being used in other fields. For example, my lawyer friend has started using one by LexisNexis[0] and is duly impressed by how it works. It's only a matter of time before models like that get increasingly specialized for that kind of work, it's just harder for lawyers to drive that kind of change alone. Plus, there's a lot more resistance in 'legacy' professions to any kind of change, much less one that is perceived to threaten the livelihoods of established professionals.

Current LLMs are already not bad at a lot of things, but lawyer bots, accountant bots and more are likely coming.

[0] https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page



Those are regulated industries, where as software development is not.

An AI spitting back bad code won't compile. An AI spitting back bad financial/legal advice bankrupts people.



Generally I agree! I saw a guy shamefully admit he didn't read the output carefully enough when using generated code (that ran), but there was a min() instead of a max(), and it messed up a month of his metrics!



Coding models solve a clear problem and have a clear integration into a developer's workflow - it's like your own personal StackOverflow and it can autocomplete code for you. It's not as clear when it comes to finance or legal, you wouldn't want to rely on an AI that may hallucinate financial numbers or laws. These other professions are also a lot slower to react to change, compared to software development where people are already used to learning new frameworks every year



Generating code has significant economical benefit. The code once generated can be execute so many times without requiring high computing resources, unlike AI model.



> Where's the finance or legal models from Mistral or Meta or OpenAI?

Programming is "weird" in that it requires both specialized knowledge and specialized languages, and the languages are very different from any language that humans speak.

Legal requires specialized knowledge, but legal writing is still just English and it follows English grammar rules, although it's sometimes a very strange "dialect" of English.

Finance is weird in its own way, as that requires a lot more boring, highly-precise calculations, and LLMs are notoriously bad at those. I suspect that finance is always going to be some hybrid of an LLM driving an "old school" computer to do the hard math, via a programming language or some other, yet-unenvisioned protocol.

> programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack.

This is true, mostly because of programmers' love of textual languages, textual protocols, CLI interfaces and generally all things text. If we were all coding in Scratch, this would be a lot harder.



Yes, it appears to be the clear successful usecase for the technology, in a way that hasn't been replicated for other professions.

I remain very sceptical that a chat-like interface is the ideal form for LLMs, yet it seems very optimal for programming specifically, along with Copilot-like interfaces of just outputting text.



Finance already has their own models and has had them for decades. Market predictions and high frequency trading is literally what all the hedge funds and the like have been doing for a few decades now. Including advanced sources of information like (take with a grain of salt, I've heard it on the internet) using satellite images to measure factory activity and thus predict results.

Understandably they're all quite secretive about their tooling because they don't want the competition to have access to the same competitive advantages, and an open source model / third party developing a model doesn't really make sense.



The explanation is easier, I think. Consider what data these models are trained on, and who are the immediate developers of these models.

The models are trained on a vast set of whatever is available on the internet. They are developed by tech people/programmers who are surprisingly blind to their own biases and interests. There's no surprise that one of the main things they want to try and do is programming, using vast open quantities of Stack Overflow, GitHub and various programming forums.

For finance and legal you need to:

- think a bit outside the box

- be interested in finance and legal

- be prepared to carry actual legal liability for the output of your models



> - be prepared to carry actual legal liability for the output of your models

Section 230.

It's been argued that a response by a LLM, to user input, is "user-generated content" and hence the platform has generally no liability (except CSAM).

Nobody has successfully sued.



Then again, we just had this on the front page: https://news.ycombinator.com/item?id=40957990

> We first document a significant decline in stock trading volume during ChatGPT outages and find that the effect is stronger for firms with corporate news released immediately before or during the outages. We further document similar declines in the short-run price impact, return variance, and bid-ask spreads, consistent with a reduction in informed trading during the outage periods. Lastly, we use trading volume changes during outages to construct a firm-level measure of the intensity of GAI-assisted trading and provide early evidence of a positive effect of GAI-assisted trading on long-run stock price informativeness.

They're being used, but nobody is really saying anything because the stock market is a zero sum game these days and letting anyone else know that this holds water is a recipe for competition. Programming is about the opposite, the more you give, the more you get, so it makes sense to popularize it as a feature.



I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.



Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.

Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.

It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.



Does anyone know whether the 128K is input tokens only? There are a lot of models that have a large context window for input but a small output context. If this actually has 128k tokens shared between input and output, that would be a game changer.



Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.



Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.



Well it's not on Le Chat, it's not on LMSys, it has a new tokenizer that breaks llama.cpp compatibility, and I'm sure as hell not gonna run it with Crapformers at 0.1x speed which as of right now seems to be the only way to actually test it out.



Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.



1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.



The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

> Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU



What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.



I suspect you might be confusing the numbers: 12B (which is the very first number they give) is not context length, it's parameter count.

The reason to use parameter count is because final size in GB depends on quantization. A 12B model at 8 bit parameter width would be 12Gbytes (plus some % overhead), while at 16 bit would be 24Gbytes.

Context length here is 128k which is orthogonal to model size. You can notice the specify both parameters and context size because you need both to characterize an LLM.

It's also interesting to know what parameter width it was trained on because you cannot get more information by "quantizing wider" -- it only makes sense to quantize into a narrower parameter width to save space.



>What's the reason for measuring the model size in context window length and not GB?

there are 2 different things.

The context window is how many tokens ii's context can contain, so on a big model you could put in the context a few books and articles and then start your questions, on a small context model you can start a conversation and after a short time it will start forgetting eh first prompts. Big context will use more memory and will cost on performance but imagine you could give it your entire code project and then you can ask it questions, so often I know there is some functions already there that does soemthing but I can't remember the name.



The reason that companies align models is so that they don't get on the front page of the new york times with a headline like "Techaro's AI model used by terrorists to build a pipe bomb that destroyed the New York Stock Exchange datacentre".

联系我们 contact @ memedata.com