(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39383446

然而,为了避免丢失重要的历史细节,研究人员开发了在模型结构内紧凑地存储历史事件的方法,例如使用权重矩阵或采用压缩算法。 这些方法使模型能够在高效运行的同时长时间保留关键信息。 此外,研究人员利用缓存机制和渐进式训练策略来降低计算成本,从而实现更长的保留持续时间。 最终,这些技术使模型能够在资源有限的环境中有效运行,并针对大量历史数据提供有意义的见解和发现。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Our next-generation model: Gemini 1.5 (blog.google)
1090 points by todsacerdoti 19 hours ago | hide | past | favorite | 516 comments










The white paper is worth a read. The things that stand out to me are:

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). This is going to make things much, much simpler for a lot of use cases.

4. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

5. It seems like 1.5 Ultra is going to be highly capable. 1.5 Pro is already very very capable. They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives.

Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. I've found 1.0 Ultra to be very capable, if a bit slow. Open models downstream should see a significant uptick in quality using it, which is great.

Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Write a new module that does X".

I really want to know how they're getting to 10M context, though. There are some intriguing clues in their results that this isn't just a single ultra-long vector; for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. The text insertion occasionally fails. I'd speculate that this means there is some sort of compression going on; a full video frame with text on it is going to use a lot more tokens than the text needle.



"The 10M context ability wipes out most RAG stack complexity immediately."

I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.

I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.



They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.


I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.


I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.


I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).


Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.



Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.



You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.


Would be awesome if it is solved but seems like a much deeper problem tbh.


I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.



I believe that's a limitation of using vectors of high dimensions. It'll be muddy.


Not unlike trying to keep the whole contents of the document in your own mind :)


It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.


Unfortunately Google's track record with language models is one of overpromising and underdelivering.


This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).


Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.


LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?

What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?



Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.


Also unless they significantly change their pricing model, we're talking about 0.5$ per API call at current prices


I think there are also a lot of people who are only interested in RAG if they can self-host and keep their documents private.


Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.


Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.


costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.


I think the unspoken goal is to just lay off your employees and dump every doc and email they’ve ever written as one big context.

Now that Google has tasted the previously forbidden fruit of layoffs themselves, I think their primary goal in ML is now headcount reduction.



also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)


How much would a lawyer charge to review your 10M-token legal document?


10M tokens is something like 14 copies of war and peace, or maybe the entire harry potter series seven times over. That'd be some legal document!


Hmm I don’t know but I feel like the U.S. Congress has bills that would push that limit.


> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

They try to push that, but it's not the most convincing. Look at Table 8 for text evaluations (math, etc.) - they don't even attempt a comparison with GPT-4.

GPT-4 is higher than any Gemini model on both MMLU and GSM8K. Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71). Gemini Pro does crush naive GPT-4 on math (though not with code interpreter and this is the original model).

All in 1.5 Pro seems maybe a bit better than 1.0 Ultra. Given that in the wild people seem to find GPT-4 better for say coding than Gemini Ultra, my current update is Pro 1.5 is about equal to GPT-4.

But we'll see once released.



> people seem to find GPT-4 better for say coding than Gemini Ultra

For my use cases, Gemini Ultra performs significantly better than GPT-4.

My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I took 20 prompts that I'd run with GPT-4 and fed them to Gemini Ultra. Gemini gave a clearly better result in 16 out of 20 cases.

Where GPT-4 might miss one or two requirements, Gemini usually got them all. Where GPT-4 might require multiple chat turns to point out its errors and omissions and tell it to fix them, Gemini often returned the result I wanted in one shot. Where GPT-4 hallucinated a method that doesn't exist, or had been deprecated years ago, Gemini used correct methods. Where GPT-4 called methods of third-party packages it assumed were installed, Gemini either used native code or explicitly called out the dependency.

For the 4 out of 20 prompts where Gemini did worse, one was a weird rejection where I'd included an image in the prompt and Gemini refused to work with it because it had unrecognizable human forms in the distance. Another was a simple bash script to split a text file, and it came up with a technically correct but complex one-liner, while GPT-4 just used split with simple options to get the same result.

For now I subscribe to both. But I'm using Gemini for almost all coding work, only checking in with GPT-4 when Gemini stumbles, which isn't often. If I continue to get solid results I'll drop the GPT-4 subscription.



I have a very similar prompting style to yours and share this experience.

I am an experienced programmer and usually have a fairly exact idea of what I want, so I write detailed requirements and use the models more as typing accelerators.

GPT-4 is useful in this regard, but I also tried about a dozen older prompts on Gemini Advanced/Ultra recently and in every case preferred the Ultra output. The code was usually more complete and prod-ready, with higher sophistication in its construction and somewhat higher density. It was just closer to what I would have hand-written.

It's increasingly clear though LLM use has a couple of different major modes among end-user behavior. Knowledge base vs. reasoning, exploratory vs. completion, instruction following vs. getting suggestions, etc.

For programming I want an obedient instruction-following completer with great reasoning. Gemini Ultra seems to do this better than GPT-4 for me.



I’m going to have to try Gemini for code again. It just occurred to me as a Xoogler that if they used Google’s code base as the training data it’s going to be unbeatable. Now did they do that? No idea, but quality wins over quantity, even with LLM.


There is no way NTK data is in the training set, and google3 is NTK.


I dunno, leadership is desperate and they can de-NTK if and when they feel like it.


What is “NTK”?


"Need To Know" I.e. data that isn't open within the company.


It constantly hallucinates APIs for me, I really wonder why people's perceptions are so radically different. For me it's basically unusable for coding. Perhaps I'm getting a cheaper model because I live in a poorer country.


Are you using Gemini Advanced? (The paid tier.) The free one is indeed very bad.


I asked Gemini Advanced, the paid one, to "Write a script to delete some files" and it told me that it couldn't do that because deleting files was unethical. At that point I cancelled my subscription since even GPT-4 with all its problems isn't nearly as broken as Gemini.


> My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I guess this is a tough request if you're working on a proprietary code base, but I would love to see some concrete examples of the prompts and the code they produce.

I keep trying this kind of prompting with various LLM tools including GPT-4 (haven't tried Gemini Ultra yet, I admit) and it nearly always takes me longer to explain the detailed requirements and clean up the generated code than it would have taken me to write the code directly.

But plenty of people seem to have an experience more like yours, so I really wonder whether (a) we're just asking it to write very different kinds of code, or (b) I'm bad at writing LLM-friendly requirements.



Not OP but here is a verbatim prompt I put into these LLMs. I'm learning to make flutter apps, and I like to try make various UIs so I can learn how to compose some things. I agree that Gemini Ultra (aka the paid "advanced" mode) is def better than ChatGPT-4 for this prompt. Mine is a bit more terse than OP's huge prompt with numbered requirements, but I still got a super valid and meaningful response from Gemini, while GPT4 told me it was a tricky problem, and gave me some generic code snippets, that explicitly don't solve the problem asked.

> I'm building a note-taking app in flutter. I want to create a way to link between notes (like a web hyperlink) that opens a different note when a user clicks on it. They should be able to click on the link while editing the note, without having to switch modalities (eg. no edit-save-view flow nor a preview page). How can I accomplish this?

I also included a follow-up prompt after getting the first answer, which again for Gemini was super meaningful, and already included valid code to start with. Gemini also showed me many more projects and examples from the broader internet.

> Can you write a complete Widget that can implement this functionality? Please hard-code the note text below:



Is there any chance you could share an example of the kind of prompt you're writing?

I'm always reluctant to write long prompts because I often find GPT4 just doesn't get it, and then I've wasted ten minutes writing a prompt



I've found Gemini generally equal with the .Net and HTML coding I've been doing.

I've never had Gemini give me a better result than GPT, though, so it does not surpass it for my needs.

The UI is more responsive, though, which is worth something.



How do you interact with Gemini for coding work? I am trying to paste my code in the web interface and when I hit submit, the interface says "something went wrong" and the code does not appear in the chat window. I signed up for Gemini Advanced and that didn't help. Do you use AI Studio? I am just looking in to that now.


> Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71).

Though they talk a bunch about how hard it was to filter out Human Eval, so this probably doesn't matter much.



I mean i don't see GPT4 watching a 44 minute movie and being able to exactly pinpoint a guy taking a paper out of his pocket..


    > The 10M context ability wipes out most RAG stack complexity immediately.
Remains to be seen.

Large contexts are not always better. For starters, it takes longer to process. But secondly, even with RAG and the large context of GPT4 Turbo, providing it a more relevant and accurate context always yields better output.

What you get with RAG is faster response times and more accurate answers by pre-filtering out the noise.



Hopefully we can get a better RAG out of it. Currently people do incredibly primitive stuff like chunking text into chunks of a fixed size and adding them to vector DB.

An actually useful RAG would be to convert text to Q&A and use Q's embeddings as an index. Large context can make use of in-context learning to make better Q&A.



A lot of people in RAG already do this. I do this with my product: we process each page and create lists of potential questions that the page would answer, and then embed that.

We also embed the actual text, though, because I found that only doing the questions resulted in inferior performance.



So in this case, what your workflow might look like is:

    1. Get text from page/section/chunk
    2. Generate possible questions related to the page/section/chunk
    3. Generate an embedding using { each possible question + page/section/chunk }
    4. Incoming question targets the embedding and matches against { question + source }
Is this roughly it? How many questions do you generate? Do you save a separate embedding for each question? Or just stuff all of the questions back with the page/section/chunk?


Right now I just throw the different questions together in a single embedding for a given chunk, with the idea that there’s enough dimensionality to capture them all. But I haven’t tested embedding each question, matching on that vector, and then returning the corresponding chunk. That seems like it’d be worth testing out.


Don't forget that Gemini also has access to the internet, so a lot of RAGging becomes pointless anyway.


Internet search is a form of RAG, though. 10M tokens is very impressive, but you're not fitting a database, let alone the entire internet into a prompt anytime soon.


You shouldn't fit an entire database in the context anyway.

btw, 10M tokens is 78 times more context window than the newest GPT-4-turbo (128K). In a way, you don't need 78 GPT-4 API calls, only one batch call to Gemini 1.5.



I don't get this why is it people think that you need to put an entire database in the short-term memory of the AI to be useful? When you work with a DB are you memorizing the entire f*cking database, no, you know the summaries of it and how to access and use it.

People also seem to forget that the average is 1b words that are read by people in their entire LIFETIME, and at 10m, with nearly 100% recall thats pretty damn amazing, i'm pretty sure I don't have perfect recall of 10m words myself lol



You certainly don't need that much context for it to be useful, but it definitely opens up a LOT more possibilities without the compromises of implementing some type of RAG. In addition, don't we want our AI to have superhuman capabilities? The ability to work on 10M+ tokens of context at a time could enable superhuman performance in many tasks. Why stop at 10M tokens? Imagine if AI could work on 1B tokens of context like you said?


It increases the use cases.

It can also be a good alternative for fine-tuning.

And the use case of a code base is a good example: if the ai understands the whole context, it can do basically everything.

Let me pay 5€ for a android app rewritten into iOS.



Well it's nice, just sad nobody can use it


This may be useful in a generalized use case, but a problem is that many of those results again will add noise.

For any use case where you want contextual results, you need to be able to either filter the search scope or use RAG to pre-define the acceptable corpus.



> you need to be able to either filter the search scope or use RAG ...

Unless you can get nearly perfect recall with millions of tokens, which is the claim made here.



> The 10M context ability wipes out most RAG stack complexity immediately.

The video queries they show take around 1 minute each, this probably burns a ton of GPU. I appreciate how clearly they highlight that the video is sped up though, they're clearly trying to avoid repeating the "fake demo" fiasco from the original Gemini videos.



The youtube video of the Multimodal analysis of a video is insane, imagine feeding in movies or tv shows and being able to autosummary or find information about them dynamically, how the hell is all this possible already? AI is moving insanely fast.


> imagine feeding in movies or tv shows

Google themselves have such a huge footprint of various businesses, that they alone would be an amazing customer for this, never mind all the other cool opportunities from third parties...

Imagine that they can ingest the entirety of YouTube and then dump that into Google Search's index AND use it to generate training data for their next LLM.

Imagine that they can hook it up to your security cameras (Nest Cam), and then ask questions about what happened last night.

Imagine that you can ask Gemini how to do something (eg. fix appliance), and it can go and look up a YouTube video on how to accomplish that ask, and explain it to you.

Imagine that it can apply summarization and descriptions to every photo AND video in your personal Google Photos library. You can ask it to find a video of your son's first steps, or a graduation/diploma walk for your 3rd child (by name) and it can actually do that.

Imagine that Google Meet video calls can have the entire convo itself fed into an LLM (live?), instead of just a transcription. You can have an AI assistant there with you that can interject and discuss, based on both the audio and video feed.



I'd love to see that applied to the Google ecosystem, the question is - why haven't they already done this?


Regarding how they’re getting to 10M context, I think it’s possible they are using the new SAMBA architecture.

Here’s the paper: https://arxiv.org/abs/2312.00752

And here’s a great podcast episode on it: https://www.cognitiverevolution.ai/emergency-pod-mamba-memor...



As a Brazilian, I approve that choice. Vambora amigos!


Regarding the 10M tokens context, RingAttention has been shown [0] recently (by researchers, not ML engineers in a FAANG) to be able to scale to comparable (1M) context sizes (it does take work and a lot of GPUs).

[0]: https://news.ycombinator.com/item?id=39367141



> researchers, not ML engineers in a FAANG

Why did you point out this distinction?



It means they have significantly less means (to get a lot of GPUs letting them scale up in context length) and are likely less well-versed in optimization (which also helps with scaling up)[0].

I believe those two things together are likely enough to explain the difference between a 1M context length and a 10M context length.

[0]: Which is not looking down on that particular research team, the vast majority of people have less means and optimization know-how than Google.



Probably to indicate that its research and not productized?


Is 10M token context correct? The blog post I see 1M but I'm not sure if these are different things

Edit: Ah, I see, it's 1M reliably in production, up to 10M in research:

> Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

> This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.



How could one hour of video fit in 1M tokens? 1 hour at 30fps is 3600*30=100k frames. Each frame is converted in 256 tokens. So either they are not processing each frame, or each frame is converted into fewer tokens.


The model can probably perform fine at 1 frame per second (3600*256=921600 tokens), and they could probably use some sort of compression.


I know how I’m going to evaluate this model. Upload my codebase and ask it to “find all the bugs”.


> 1. They don't talk about how they get to 10M token context

> 2. They don't talk about how they get to 10M token context

Yes. I wonder if they're using a "linear RNN" type of model like Linear Attention, Mamba, RWKV, etc.

Like Transformers with standard attention, these models train efficiently in parallel, but their compute is O(N) instead of O(N²), so in theory they can be extended to much longer sequences much efficiently. They have shown a lot of promise recently at smaller model sizes.

Does anyone here have any insight or knowledge about the internals of Gemini 1.5?



The fact they are getting perfect recall with millions of tokens rules out any of the existing linear attention methods.


They do give a hint:

"This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture."

One thing you could do with MoE is giving each expert different subsets of the input tokens. And that would definitely do what they claim here: it would allow search. You want to find where someone said "the password is X" in a 50 hour audio file, this would be perfect.

If your question is "what is the first AND last thing person X said" ... it's going to suck badly. Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.



> Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

They kinda address that in the technical report[0]. On page 12 they show results from a "multiple needle in a haystack" evaluation.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...



> One thing you could do with MoE is giving each expert different subsets of the input tokens.

Don't MoE's route tokens to experts after the attention step? That wouldn't solve the n^2 issue the attention step has.

If you split the tokens before the attention step, that would mean those tokens would have no relationship to each other - it would be like inferring two prompts in parallel. That would defeat the point of a 10M context



Is MOE then basically divide and conquer? I have no deep knowledge of this so I assumed MOE was where each expert analyzed the problem in a different way and then there was some map-reduce like operation on the generated expert results. Kinda like random forest but for inference.




Re RAG aren’t you ignoring the fact that no one wants to put confidential company data into such LLM’s. Private RAG infrastructure remains a need for the same reason that privacy of data of all sorts remains a need. Huge context solves the problem for large open source context material but that’s only part of the picture.


There will always be more data that could be relevant than fits in a context window, and especially for multi-turn conversations, huge contexts incur huge costs.

GPT-4 Turbo, using its full 128k context, costs around $1.28 per API call.

At that pricing, 1m tokens is $10, and 10m tokens is an eye-watering $100 per API call.

Of course prices will go down, but the price advantage of working with less will remain.



I don't see a problem with this pricing. At 1m tokens you can upload the whole proceedings of a trial and ask it to draw an analysis. Paying $10 for that sounds like a steal.


Unfortunately the whole context has to be reprocessed fully for each query, which means that if you "chat" with the model you'll incur in that $10 fee for every interaction which quickly sums up.

It may still be worth it for some use cases



While it's hard to say what's possible on the cutting edge, historically models tend to get dumber as the context size gets bigger. So you'd get a much more intelligent analysis of a 10,000 token excerpt of the trial than a million token complete transcript of the trial. I have not spent the money testing big token sizes in GPT 4 turbo, but it would not surprise me if it gets dumber. Think of it this way, if the model is limited to 3,000 token replies, if an analysis would require a more detailed response than 3,000 tokens, it cannot provide it, it'll just give you insufficient information. What it'll probably do is ignore parts of the trial transcript because it can't analyze all that information in 3,000 tokens. And asking a followup question is another million tokens.


Of course, if you get exactly the answer you want in the first reply.


Would the price really increase linearly? Isn't the demands on compute and memory increasing steeper than that as a function of context length?


> The 10M context ability wipes out most RAG stack complexity immediately.

This may not be true. My experience of the complexity of RAG lays in how to properly connect to various unstructured data sources and perform data transformation pipeline for large scale data set (which means GB, TB or even PB). It's in the critical path rather a "nice to have", because the quality of data and the pipeline is a major factor for the final generated the result. i.e., in RAG, the importance of R >>> G.



RE: RAG - they haven't released pricing, but if input tokens are priced at GPT-4 levels - $0.01/1K then sending 10M tokens will cost you $100.


In the announcements today they also halved the pricing of Gemini 1.0 Pro to $0.000125 / 1K characters, which is a quarter of GPT3.5 Turbo so it could potentially be a bit lower than GPT-4 pricing.


If you think the current APIs will stay that way, then you're right. But when they start offering dedicated chat instances or caching options, you could be back in the penny region.

You probably need a couple GB to cache a conversation. That's not so easy at the moment because you have to transfer that data to and from the GPUs and store the data somewhere.



The tokens need to be fed into the model along with the prompt and this takes time. Naive attention is O(N^2). They probably use at least flash attention, and likely something more exotic to their hardware.

You'll notice in their video [1] that they never show the prompts running interactively. This is for a roughly 800K context. They claim that "the model took around 60s to respond to each of these prompts".

This is not really usable as an interactive experience. I don't want to wait 1 minute for an answer each time I have a question.

[1] https://www.youtube.com/watch?v=SSnsmqIj1MI



> They don't talk about how they get to 10M token context

I don't know how either but maybe https://news.ycombinator.com/item?id=39367141

Anyway I mean, there is plenty of public research on this so it's probably just a matter of time for everyone else to catch up



Why do you think this specific variant (RingAttention)? There are so many different variants for this.

As far as I know, the problem in most cases is that while the context length might be high in theory, the actual ability to use it is still limited. E.g. recurrent networks even have infinite context, but they actually only use 10-20 frames as context (longer only in very specific settings; or maybe if you scale them up).



There are ways to test the neural network’s ability to recall from a very long sequence. For example, if you insert a random sentence like “X is Sam Altman” somewhere in the text, will the model be able to answer the question “Who is X?”, or maybe somewhat indirectly “Who is X (in another language)” or “Which sentence was inserted out of context?” “Which celebrity was mentioned in the text?”

Anyways the ability to generalize to longer context length is evidenced by such tests. If every token of the model’s output is able to answer questions in such a way that any sentence from the input would be taken into account, this gives evidence that the full context window indeed matters. Currently I find Claude 2 to perform very well on such tasks, so that sets my expectation of how a language model with an extremely long context window should look like.



> The 10M context ability wipes out most RAG stack complexity immediately.

1. People mention accuracy issues with longer contexts 2. People mention processing time issues with longer contexts 3. Something people haven't mentioned in this thread is cost -- even thought prompt tokens are usually cheaper than generated tokens, and Gemini seems to be cheaper than GPT-4, putting a whole knowledge base or 80-page document in the context is going to make every time you run that prompt quite expensive



> The 10M context ability wipes out most RAG stack complexity immediately.

RAG is needed for the same reason you don't `SELECT *` all of your queries.



>3. The 10M context ability wipes out most RAG stack complexity immediately.

I'd imagine RAG would still be much more efficient computationally



This might be a stupid question - even if there's no quality degradation from 10M context, will it be extremely slow in reference?


RAG would still be useful for cost savings assuming they charge per token, plus I'm guessing using the full-context length would be slower than using RAG to get what you need for a smaller prompt


This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.



Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.



The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Really? Because to my understanding the compute necessary to generate a token grows linearly with the context, and doesn't the OpenAI billing reflect that by seperating prompt and output tokens?



I assume using this large of a context window instead of RAG would mean the consumption of many orders of magnitude more GPU.


RAG doesn’t go away at 10 Million tokens if you do esoteric sources like shodan API queries.


Even 1m tokens eliminate the need for RAG, unless it is for cost.


1 million might sound like a lot, but it's only a few megabytes. I would want RAG, somehow, to be able to process gigabytes or terabytes of material in a streaming fashion.


RAG will not change how many tokens LLM can produce at once.

Longer context on the other hand, could put some RAG use cases to sleep, if your instructions are like, literally a manual long, then there is no need for rag.



I think RAG could be used that do that. If you have a one time retrieval in the beginning, basically amending the prompt, then I agree with you. But there are projects (classmate doing his masters thesis project as one implementation of this) that retrieves once every few tokens and make the retrieved information available to the generation somehow. That would not take a toll on the context window.


Or accuracy


I just hope at some point we get access to mostly uncensored models. Both GPT-4 and Gemini are extremely shackled, and a slightly inferior model that hasn’t been hobbled by a very restricting preprompt would handily outperform them.


You can customize the system prompt with ChatGPT or via the completions API, just fyi.


What's RAG?


Retrieval Augmented Generation. In basic terms, it optimizes output of LLMs by using additional external data sources before answering queries. (That actually might be too basic of a description)

Here:

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge...



Retrieval augmented generation.

> Retrieval Augmented Generation (RAG) is a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt.

(stolen from: https://github.com/psychic-api/rag-stack)



> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting

I fully disagree, they compare Gemini 1.5 Pro and GPT4 only on context length, not on other tasks where they compare it only to other Gemini which is a strange self-own.

I'm convinced that if they do not show the results against GPT4/Claude, it is because they do not look good.



For #1 and #2 it is some version of mixture of experts. This is mentioned in the blog post. So each expert only sees a subset of the tokens.

I imagine they have some new way to route tokens to the experts that probably computes a global context. One scalable way to compute a global context is by a state space model. This would act as a controller and route the input tokens to the MoEs. This can be computed by convolution if you make some simplifying assumptions. They may also still use transformers as well.

I could be wrong but there are some Mamba-MoEs papers that explore this idea.



After their giant fib with the Gemini video a few weeks back I'm not believing anything til I see it used by actual people. I hope it's that much better than GPT-4, but I'm not holding my breath there isn't an asterisk or trick hiding somewhere.


It takes 60 seconds to process all of that context in their three.js demo, which is, I will say, not super interactive. So there is still room for RAG and other faster alternatives to narrow the context.


How do you know it isn't RAG?


FYI, MM is the standard for million. 10MM not 10M I’m reading all these comments confused as heck why you are excited about 10M tokens


Maybe for accountants, but for everyone else a single M is much more common.


Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I just checked again, and it gave a similar answer [1]. When I tried ChatGPT 3.5, it did give an answer, although it was a little confused, and the code wasn't completely correct.

This seems to be a common experience, as apparently it refuses to give advice on copying memory in C# [2], and I tried to do what was suggested in this comment [3], but by the next prompt it was refusing again, so I had to stick to ChatGPT.

[0] https://g.co/gemini/share/238032386438

[1] https://g.co/gemini/share/6880989ddfaf

[2] https://news.ycombinator.com/item?id=39312896

[3] https://news.ycombinator.com/item?id=39313567



One interesting tidbit from the technical report:

>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conservative filtering heuristics. An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pretraining on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. We found that this sharp increase persisted even when examples were embedded in extraneous formats (e.g. JSON, HTML). We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage. The Natural2Code benchmark, which we announced and used in the evaluation of Gemini 1.0 series of models, was created to fill this gap. It follows the exact same format of HumanEval but with a different set of prompts and tests.



Massive whoa if true from technical report

"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"

https://storage.googleapis.com/deepmind-media/gemini/gemini_...



10M tokens is absolutely jaw dropping. For reference, this is approximately thirty books of 500 pages each.

Having 99% retrieval is nuts too. Models tend to unwind pretty badly as the context (tokens) grows.

Put these together and you are getting into the territory of dumping all your company documents, or all your departments documents into a single GPT (or whatever google will call it) and everyone working with that. Wild.



Seems like Google caught up. Demis is again showing an incredible ability to lead a team to make groundbreaking work.


If any of this is remotely true, not only did it catch up, it’s wiping the floor with how useful it can be compared to GPT4. Not going to make a judgement until I can actually try it out though.


In the demo videos gemini needs about a minute to answer long context questions. Which is better than reading thousands of pages yourself. But if it has to compete with classical search and skimming it might need some optimization.


Replacing grep or `ctrl+F` with Gemini would be the user's fault, not Gemini's. If classical search for a job already a performant solution, use classical search. Save your tokens for jobs worthy of solving with a general intelligence!


That’s a compute problem, something that involves just throwing money at the problem.


Another whoa for me

>Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM



I think this somewhat is mostly due to the ability to handle high context lengths better. Note how Claude 2.1 already highly outperforms GPT-4 on this task.


GPT-4V turbo outperforms Claude on long contexts, IIRC. Unless that's mistaken, I'd suspect a different explanation for that task.


Did you watch the video of the Gemini 1.5 video recall after it processed the 44 minute video... holy shit


So, will this outperform any RAG approach as long as the data fits inside the context window?


A perfect RAG system would probably outperform everything in a larger context due to prompt dilution, but in the real world putting everything in context will win a lot of the time. The large context system will also almost certainly be more usable due to elimination of retrieval latency. The large context system might lose on price/performance though.


Outperform is dependent on the RAG approach (and this would be a RAG approach anyways, you can already do this with smaller context sizes). A simplistic one, probably, but dumping in data that you don't need dilutes the useful information, so I would imagine there would be at least _some_ degradation.

But there is also the downside of "tuning" the RAG to return less tokens you will miss extra context that could be useful to the model.



Doesn't their needle/haystack benchmark seem to suggest there is almost no dilution? They pushed that demo out to 10M tokens.


Cost would still be a big concern


are you going to upload 10M tokens to Gemini on every request? That's a lot of data moving around when the user is expecting a near realtime response. Seems like it would still be better to only set the context with information relevant to the user's prompt which is what plain rag does.


basically, yes. Pinecone? Dead. Azure AI Search? Dead. Quadrant? Dead.


Prompt token cost still a variable.


Could you (or someone) explain what this means?


It's how much text it can consider at a time when generating a response. Basically the size of the prompt. A token is not quite a word but you can think of it as roughly that. Previously, the best most LLMs could do is around 32K. This new model does 1M, and in testing they could put it up to 10M with near perfect retrieval.

As the other comment mentions, you can paste the content of entire books or documents and ask very pointed question about it. Last year, Anthropic was showing off their 100K context window, and that's exactly what they did, they gave it the content of The Great Gatsby and asked it questions about specific lines of the book.

Similarly, imagine giving it hundreds of documents and asking it to spot some specific detail in there.



Awesome explanation, thanks for the comparison


The input you give it can be very long. This can qualitatively change the experience. Imagine, for example, copy pasting the entire lord of the rings plus another 100 books you like and asking it to write a similar book...


I just googled it, and the LOTR trilogy apparently has a total of 480,000 words, which brings home how huge 1M is! It'd be fascinating to see how well Gemini could summarize the plot or reason about it.

One point I'm unclear on is how these huge context sizes are implemented by the various models. Are any of them the actual raw "width of the model" that is propagated through it, or are these all hierarchical summarization and chunk embedding index lookup type tricks?



For another reference, Shakespeare’s complete works are ~885k words.

The Encyclopedia Britannica is ~44M words.



Reading Lord of the Rings, and writing a quality book in the same style, are almost wholly unrelated tasks. Over 150 million copies of Lord of the Rings have been sold, but few readers are capable of "writing a similar book" in terms of quality. There's no reason to think this would work well.


I mean, Terry Brooks did it with the Sword of Shannara. (/s)


I doubt it’s smart enough to write another (coherent, good) book based on 103 books. But you could ask it questions about the books and it would search and synthesize good answers.


Until I can talk to it, I care exactly zero.


you can buy their stock if you think they'll make a lot of money with their tech


Well that's really the right question .. what can, and will, Google do with this that can move their corporate earnings needle in a meaningful way? Obviously they can sell API access and integrate it into their Google docs suite, as well as their new Project IDX IDE, but do any of these have potential to make a meaningful impact ?

It's also not obvious how these huge models will fare against increasingly capable open source ones like Mixtral, perhaps especially since Google are confirming here that MoE is the path forward, which perhaps helps limit how big these models need to be.



In the long run it could move the needle in enterprise market share of Workspace and GCP. They have a lot of room to grow and IMO have a far superior product to O365/Azure which could be exacerbated by strong AI products. Only problem is this sales cycle can take a decade or more, and Google hasn’t historically been patient or strategic about things like this.


0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form.


Essentially, the focus seems to be on leveraging the media buzz around Gemini 1.0 by highlighting the development of version 1.5. While GPT-4's position relative to Gemini 1.5 remains unclear, and the specifics of ChatGPT 4.5 are yet to be disclosed, it's worth noting that no official release has taken place until the functionality is directly accessible in user chats.

Google appears to be making strides in catching up.

When it comes to my personal workflow and accomplishing tasks, I still find ChatGPT to be the most effective tool. My familiarity with its features has made it indispensable. The integration of mentions and tailored GPTs seamlessly enhances my workflow.

While Gemini may match the foundational capabilities of LLMs, it falls short in delivering a product that efficiently aids in task completion.



I completely share the same views as you after their last video - and it appears that they've learnt their lesson this time.

If you watch the videos in the blog post, you can see it's a screen recording on a computer without any editing/stitching of different scenes together.

It's good to be sceptical but as engineers we should all remain open.



The videos shown in these demos have clearly learnt from that as they're using a real live product, filmed on their computers with timers in the bottom showing how long the computations take.


100%. Google continues to underwhelm. Not buying it until I can try it.


>Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM

From the technical report https://storage.googleapis.com/deepmind-media/gemini/gemini_...



what if we ask it to translate an undeciphered language


It produces basically random translations. This is covered in the 0-shot case where no translation manual was included in the context. Due to how rare this language is, it’s essentially untranslated in the training corpus.


If you mean to dump random passages of text with no parallel corpora or grammar instructions then it won't do better than random.

That said, I think that if you gave a LLM language text to predict during training, I believe that even if no parallel corpora exists during training, we could have a LLM that could still translate that language to some other language it also trained on.



What if we added a bunch of linguistic analysis books or something


> at a similar level to a person learning from the same content.

That's an incredibly low bar



:muffled sounds of goalposts being shifted in the distance:

Just a few years ago we used to clap if an NLP model could handle negation reliably or could generate even a paragraph of text in English that was natural sounding.

Now we are at a stage where it is basically producing reams of natural sounding text, performing surprisingly well on reasoning problems and translation of languages with barely any data despite being a markov chain on steroids, and what does it hear? "That's an incredibly low bar".



I'm going to keep beating this dead horse, but if you were a philosophy nerd in the 80s, 90s, 00s etc you may know that debates RAGED over whether computers could ever, even in principle do things that are now being accomplished on a weekly basis.

And as you say, the goalposts keep getting moved. It used to be claimed that computers could never play chess at the highest levels because that required "insight". And whatever a computer could do, it could never to that extra special thing, that could only be described in magical undefined terms.

I just hope there's a moment of reckoning for decades upon decades of arguments, deemed academically respectable, that insisted that days like these would never come.



Forget goalpost shifting, people frequently refuse to admit that it can do things that it obviously does, because they've never used it themselves.


Listen, you little ...


Honestly. I am ok with having greater and greater goals to accomplish but this sort of dismissive attitude really puts me off.


It's incredible how fast goalposts are moving.

The same feat one year ago would have been almost unbelievable.



> The author (the human learner) has some formal experience in linguistics and has studied a variety of languages both formally and informally, though no Austronesian or Papuan languages

From the language benchmark (parentheses mine).



Since when are we expecting super-human capabilities?


And in fact it already is super human. Show me a single human who can translate amongst 10+ languages across specialized domains in the blink of an eye.


Chat GPT has been super human in a lot of tasks even since 3.5.

People point out mistakes it makes that no human would make, but that doesn't negate the super-human performance it has at other tasks -- and the _breadth_ of what it can do is far beyond any single person.



Where exactly does it have super-human performance? Above average and expert-level? Sure, I'd agree, but I haven't experienced anything above that.


indeed, or a human who can analyze a hundred page text document in less than a minute and provide answers in less than a second.

the issue remains on accuracy. i think a human in that scenario is still more accurate with their responses, and i do not yet see that being overcome in this multi-year llm battle.



The model does already have superhuman ability by knowing hundreds of languages


Jarring you're not adding more context to your comment.


you are insane if you actually think this.


I've always been suspicious of any announcement from Demis Hassabis since way back in his video game days when he did a monthly article in Edge magainze about the game he was developing. "Infinite Polygons" became a running joke in the industry because of his obvious snake-oil. The game itself, Republic [1], was an uninteresting failure.

He learned how to promote himself from working for Peter "Project Milo" Molyneux and I see similar patterns of hype.

[1] https://en.wikipedia.org/wiki/Republic:_The_Revolution#Marke...



Funny read about his game.

Nonetheless while still underwhelming in comparison to gpt-4 (excluding this announcement as I haven't tried it yet), alpha go, zero and especially fold were tremendous!



Yeah, it's funny. I used to think "Demis Hassabis...where have I heard that name before?" And then I realized I saw him in the manuals for old Bullfrog games.


The line between delusional and visionary is thin! I know I'm too grounded in "expected value" math to do super outlier stuff like starting a video game company...


10M tokens is an absolute game changer, especially if there's no noticeable decay in quality with prompt size. We're going to see things like entire domain specific languages embedded in prompts. IMO people will start thinking of the prompt itself as a sort of runtime rather than a static input.

Back when OpenAI still supported raw text completion with text-davinci-003 I spent some time experimenting with tiny prompt-embedded DSLs. The results were very, very, interesting IMO. In a lot of ways, text-davinci-003 with embedded functions still feels to me like the "smartest" language model I've ever interacted with.

I'm not sure how close we are to "superintelligence" but for baseline general intelligence we very well could have already made the prerequisite technological breakthroughs.



It's pretty slow, though looks like up to 60 seconds for some of the answers, and uses god knows how much compute, so there's probably going to be some trade offs -- you're going to want to make sure that that much context is actually useful for what you want.


TBF: when talking about the first "superintelligence", I'd expect it to take unreasonable amounts of compute and/or be slow -- that can always be optimized. Bringing it into existence in the first place is the hardest part.


Yea. Of course for some tasks we need speed, but i've been kinda surprised that we haven't seen very slow models which perform far better than faster models. We're treading new territory, and everyone seems to make models that are "fast enough".

I wanna see how far this tech can scale, regardless of speed. I don't care if it takes 24h to formulate a response. Are there "easy" variables which drastically improve output?

I suspect not. I imagine people have tried that. Though i'm still curious as to why.



I think the problem is that 24 hours of compute to run a response would be incredibly expensive. I mean hell how would that even be trained.


I gotta say, I've been trying out Gemini recently and it's embarrassingly bad. I can't take anything google puts out seriously when their current offerings are so so much worse than ChatGPT (or even local llama!).

As a particularly egregious example, yesterday night I gave Gemini a list of drinks and other cocktail ingredients I had laying around and asked for some recommendations for cute drinks that I could make. It's response:

> I'm just a language model, so I can't help you with that.

ChatGPT 3.5 came up with several delicious options with clear instructions, but it's not just this instance, I've NEVER gotten a response from Gemini that I even felt was more useful than just a freaking bing search! Much less better than ChatGPT. I'm just going to assume they're using cherrypicked metrics to make themselves feel better until proven otherwise. I have zero confidence in Google's AI plays, and I assume all their competent talent is now at OpenAI or Anthropic.



My experiences are similar, but I think we are talking about the Gemini free model, available on the Google Gemini website. I think the rest of the comments are saying the paid versions (Pro / Ultra) are significantly better, though I haven't tested it myself to compare.


I have the 2 months trial for the paid version, and find myself going back to free ChatGPT often. Gemini loves to put everything in bullet point lists and short paragraphs with subheadings for example, even when asking for a letter. I'm not a heavy user, but it seems to not quite get what I want often. Not important but annoying: It starts almost every answer with "Absolutely!", even when it doesn't match the question (e.g. "How does x work?").


I don't think "I'm just a language model, I can't help you with that" comes from Gemini. Google has a seperate censorship model that blocks you from receiving Gemini's response in certain situations.

When Gemeni (Ultra) refuses to do something itself it is more verbose and specific as to why it won't do it, it my experience.



This got my trying Gemini, but doing so is such a hassle that I'm almost ready to give up. Trying out ChatGPT is as simple as signing up (either for pro, or the API), and getting a single API key.

Google requires me to navigate their absolutely insane console (seriously, I thought the AWS console was bad, but GCP takes the cake), only to tell me there is not even a way to get an API key... I had to ask Gemini through the built in interface to figure that out.



API keys are fairly straightforward in GCP though - there's an entire section for that, and even if you're stuck, the search console works.


If I understand correctly, they're releasing this for Pro but not Ultra, which I think is akin to GPT 3.5 vs 4? Sigh, the naming is confusing...

But my main takeaway is the huge context window! Up to a million, with more than 100k tokens right now? Even just GPT 3.5 level prediction with such a huge context window opens up a lot of interesting capabilities. RAG can be super powerful with that much to work with.



It's sizes

Nano/Pro/Ultra are model SIZES. 1.0/1.5 is generations of the architecture.



The announcement suggests that 1.5 Pro is similar to 1.0 Ultra.


I am reaching a bit, however, I think its a bit of a marketing technique. The Pro 1.5 being compared to the Ultra 1.0 model seems to imply that they will be releasing a Ultra 1.5 model which will presumably have similar characteristics to the new Pro 1.5 model (MOE architecture w/ a huge context window).


Apparently the technical report implies that Ultra 1.5 is a step-up again, I'm not sure it's just context length, that seems to be orthogonal in everything I've read so far.


Maybe this analogy would help: iPhone 15, iPhone Pro 15, iPhone Pro Max 15 and then iPhone Pro 15.5


So Pro and Ultra are from my understanding link to the number of parameters. More parameters means more reasonning capabilities, but more compute needed.

So Pro is like the light and fast version and Ultra the advanced and expensive one.



I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).

Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus? So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.



Gemini is named that way because of the collaboration between Google brain and deep mind


Correct except that it spits out the timestamp


Gemini is named after the spacecraft that put the second person into orbit - pretty aptly named, but not sure if this was the intention.


Google needs their Apollo.


i asked chatgpt4 to identify three humorous moments in the apollo 11 transcript and it hallucinated all 3 of them (i think -- i can't find what it's referring to). Presumably it's in it's corpus, too.

> The "Snoopy" Moment: During the mission, the crew had a small, black-and-white cartoon Snoopy doll as a semi-official mascot, representing safety and mission success. At one point, Collins joked about "Snoopy" floating into his view in the spacecraft, which was a light moment reflecting the camaraderie and the use of humor to ease the intense focus required for their mission.

The "Biohazard" Joke: After the successful moon landing and upon preparing for re-entry into Earth's atmosphere, the crew humorously discussed among themselves the potential of being quarantined back on Earth due to unknown lunar pathogens. They joked about the extensive debriefing they'd have to go through and the possibility of being a biohazard. This was a light-hearted take on the serious precautions NASA was taking to prevent the hypothetical contamination of Earth with lunar microbes.

The "Mailbox" Comment: In the midst of their groundbreaking mission, there was an exchange where one of the astronauts joked about expecting to find a mailbox on the Moon, or asking where they should leave a package, playing on the surreal experience of being on the lunar surface, far from the ordinary elements of Earthly life. This comment highlighted the astronauts' ability to find humor in the extraordinary circumstances of their journey.



The context window size - if it really works as advertised - is pretty ground-breaking. It would replace the need to RAG or fine tune for one-off (or few-off) analys{is,es} of input streams cheaper and faster. I wonder how they got past the input token stuffing problems everyone else runs into.


They are almost certainly using some form of sparse attention. If you linearize the attention operation, you can scale up to around 1-10M tokens depending on hardware before hitting memory constraints. Linearization works off the assumption that for a subsequence of X tokens out M tokens, where M os much greater than X there are likely only K tokens which are useful for the attention operation.

There are a bunch of techniques to do this, but it's unclear how well any of them scale.



Not "almost", but certainly. Dense attention is quadratic, not even Google would be able to run it at an acceptable speed. Their model is not recurrent - they did not have the time yet (or resources - believe it or not, Google of 2023-24 is very compute constrained) to train newer SSM or recurrent based models at practical parameter counts. Then there's the fact that those models are far harder to train due to instabilities, which is one of the reasons why you don't yet see FOSS recurrent/SSM models that are SOTA at their size or tokens/sec. With sparse attention, however, long context recall will be far from perfect, and the longer the context the worse the recall. That's better than no recall at all (as in a fully dense attention model which will simply lop off the preceding parts of the conversation), but not by a hell of a lot.


maybe they are using ring attention, on top of their 128k model.


More likely some clever take on RAG. There’s no way that 1M context is all available at all times. More likely parts of it are retrievable on demand. Hence the retrieval-like use cases you see in the demos. The goal is to find a thing, not to find patterns at a distance


RAG will stick around, at some point you want to retrieve grounded information samples to inject in the context window. RAG+long context just gives you more room for grounded context.

Think building huge relevant context on topics before answering.



Tbh, I haven't read the paper, but I think it's pretty self-evident that large contexts aren't cheap - the AI has to comb through every word of the context for each successive generated token at least once, so it's going to be at least linear.


vs RAG: RAG is good for searching across >billions of tokens and providing up-to-date information to a static model. Even with huge context lengths it's a good idea to submit high quality inputs to prevent the model from going off on tangents, getting stuck on contradictory information, etc..

vs fine tuning: smaller, fine-tuned models can perform better than huge models in a decent number of tasks. Not strictly fine-tuning, but for throughput limited tasks it'll likely still be better to prune a 70B model down to 2B, keeping only the components you need for accurate inference.

I can see this model being good for taking huge inputs and compressing them down for smaller models to use.



It won't remove the use of RAG at all. That's like saying, "wow, now that I've upgraded my 128GB HDD to 1TB, I'll never run out of space again."


It's more like saying "I've upgraded to 128GB of RAM, I'll never use my disk again".


10 TB for an accurate proportion.

And I think people who buy a laptop with a 1TB SSD generally don't run out of space, at least I don't.



Saw testing earlier that suggested the context does indeed work right


This is the first time I've been legitimately impressed by one of Google's LLMs (with the obvious caveat that I'm taking the results reported in their tech report at face value).


It’s just marketing at this point, nothing to be impressed by. It’s a mistake to take at face value.


I see a lot of talk about retrieval over long context. Some even think this replaces RAG.

I don't care if the model can tell me which page in the book or which code file has a particular concept. RAG already does this. I want the model to notice how a concept is distributed throughout a text, and be able to connect, compare, contrast, synthesize, and understand all the ways that a book touches on a theme, or to rewrite multiple code files in one pass, without introducing bugs.

How does Gemini 1.5's reasoning compare to GPT-4? GPT-4 already has superhuman memory; its bottleneck is its relatively weak reasoning.



In my experience (I work mostly and deeply with Bard/Gemini), the reasoning capability of Gemini is quite good. Gemini Pro is already much better than ChatGPT 3.5, but they still make quite a few mistakes along the way. What is more worrying is that when these models made mistakes, they tried really hard to justify their reasoning (errors), practically misleading the users. Because of their high mimicry ability, users really have to pay attention to validate and eventually spot the errors. Of course, this is still far below the human level, so I'm not sure whether they add value or are more of a burden.


The most impressive demonstration of long context is this in my opinion,

https://imgur.com/a/qXcVNOM

Testing language translation abilities of an extremely obscure language after passing in one grammar book as context.



The long context length is of course incredible, but I'm more shocked that the Pro model is now on par with Ultra (~GPT-4, at least the original release). That implies when they release 1.5 Ultra, we'll finally have a GPT-4 killer. And assuming that 1.5 Pro is priced similarly to the current Pro, that's a 4x price advantage per-token.

Not surprising that OpenAI shipped a blog post today about their video generation — I think they're feeling considerable heat right now.



Gemini 1 Ultra was also said to be on par with ChatGPT 4 and it's not really there so let's see for ourselves when we can get our hands on it.


Ultra benchmarked around the original release of GPT-4, not the current model. My understanding is that was fairly accurate — it's close to current GPT-4 but not quite equal. However, close-to-GPT-4 but 4x cheaper and 10x context length would be very impressive and IMO useful.


No, it benchmarked around the original release of GPT-4 given 32 attempts versus GPT-4's 5.


Feeling the heat? Did you actually watch the videos? That was a huge leap forward compared to anything existing at the moment. Order of magnitudes away from a blog post discussing a model that maybe will finally be on par with chat gtp 4...


The openai announcement is also more or less a blog post, isn't it?

Do we know how much time or money does it take to create a movie clip?



There was Sam Altman taking live prompt requests on twitter and generating videos. They were not the same quality as some of the ones in the website, but they were still incredibly impressive.


And how much compute were those requests using?


I remember one of the biggest advantages with Google Bard was the heavily limited context window. I am glad Google is now actually delivering some exciting news now with Gemini and this gigantic token size.

Sure it's a bummer that they slap the "Join the waiting list", but it's still interesting to read about their progress and competing with ClosedAi (OpenAi).

One last thing I hope they fix is the heavily morally and ethically guardrail, sometimes I can barely ask proper questions without it triggering Gemini to educate me about what's right and wrong. And when I try the same prompt with ChatGPT and Bing ai, they happily answer.



"biggest advantages with Google Bard"

Did you mean disadvantages?



Yes, thanks.


For reference, here is the technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_...


Can anyone explain how context length is tested? Do they prompt something like:

"Remember val="XXXX" .........10M tokens later....... Print val"



Yep that's pretty much it! That's what they call needle in a haystack. See: https://github.com/gkamradt/LLMTest_NeedleInAHaystack


yep they hide things throughout the prompt and then ask it about that specific thing, imagine hiding passwords in a giant block of text and then being like, what was bobs password 10 million tokens later.

According to this it's remembering with 99% accuracy, which if you think about it is NUTS, can you imagine reading a 22x 1000 page books, and remembering every single word that was said with 100% accuracy lol



Interestingly, there's a decent chance I'd remember if there was an out of context passage saying "the password is FooBar". I wonder if it would be better to test with minor edits? E.g., "what color shirt was X wearing when..."


I think instead you could just do a full doc of relationships. "Tina and Chris have five children named ..."

Then you can ask it who is Tina's (great)^57 grandmother's twice removed cousin on her father's side?

It would have to be able to remember the context of the relationships up and down the document and there'd be nothing to key into as you could ask about any relationship.



i feel you would recognise that more as a quirk of how humans think, remember that LLMs think fundamentally differently to you and i. i would be curious about someone making a benchmark like that and using it to compare as an experiment however


I'm not trying to anthropomorphize the model, but it's not hard to imagine that a model would attribute significance to something completely out of context, and hence "focus" on it when computing attention.

Another possible synthetic benchmark would be to present a list of key value pairs and then ask it for the value corresponding to different keys. Or present a long list of distinct facts and then ask it about them. This latter one could probably be sourced from something like a trivia question and answers data set. I bet there's something like that from Jeopardy.



Yep, that’s actually a common one


Very simplified There are arrays (matrices) that are length 10M inside the model.

It’s difficult to make that array longer because training time explodes.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com