![]() |
|
Yeah but the other side of the coin is that they only explain the very basic concepts that are already settled for several years, not any of these "latest trends"
|
![]() |
|
Llamaindex docs are absolutely terrible IMO. I have gone through it so many times but still do not understand the terms and organization. Router for querying router query engine?
|
![]() |
|
I suppose with an eye on open-source, an interesting 'rule' would be to set a cut-off point for models that can run locally, and/or are considered to be feasible locally soon.
|
![]() |
|
> It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs. No, we specifically do want "most likely" to follow; the goal is to approximate Solomonoff induction as well as possible. See this recent paper by Hutter's team: https://arxiv.org/pdf/2401.14953 Quote from the paper: "LLMs pretrained on long-range coherent documents can learn new tasks from a few examples by inferring a shared latent concept. They can do so because in-context learning does implicit Bayesian inference (in line with our CTW experiments) and builds world representations and algorithms (necessary to perform SI [Solomonoff Induction]). In fact, one could argue that the impressive in-context generalization capabilities of LLMs is a sign of a rough approximation of Solomonoff induction." > In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does[n't] make sense for chat applications. I didn't say anything about sampling. A sequence prediction model represents a mapping between an input sequence and a probability distribution over all possible output sequences up to a certain length. My example uses a binary alphabet, but LLMs use an alphabet of tokens. Any chat application that expresses its output as a string of concatenated symbols from a given alphabet has a probability distribution defined over all possible output sequences. I'm simply comparing the fundamental limitations of any approach to inference that restricts its outcome space to sequences consisting of one symbol (and then layers on a meta-model to generate longer sequences by repeatedly calling the core inference capability) vs an approach that performs inference over an outcome space consisting of sequences longer than one symbol. |
![]() |
|
>Why it's maximal is not in the model at all, nor the data >It replays the data to us and we suppose the LLM must have the property that generates this data originally. So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one? Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset. Also RE inference time, when you give transformers more compute for an individual token, they perform better https://openreview.net/forum?id=ph04CRkPdC |
![]() |
|
I think what you suggest would be very similar to a encoder-decoder architecture, which has been abandoned in favor of decoder-only architectures (https://cameronrwolfe.substack.com/p/decoder-only-transforme...). So I am guessing that what you suggest has already been tried and didn't work out, ut not sure why (the problems I mentioned above or something else). Sorry, that's where the limit of my knowledge is. I work on ML stuff, but mostly on "traditional" deep learning and so I am not up to speed with the genAI field (also, the sheer amount of papers coming out makes it basically impossible stay up to date of you're not in the field). |
![]() |
|
You still need to convert between words and sentence vectors somehow. You could try using a faster model for that, but I suspect that the output quality will suffer.
|
![]() |
|
I wonder if, instead of just predicting the next n tokens, it could also predict like 128, 512, 2048 etc tokens ahead. Thus learning long-term discourse structure.
|
![]() |
|
How does that still end up making grammatical sense? If token/word +1 and +2 are predicted independently then surely often it won’t ? |
![]() |
|
After inventing multi-token, one then invents a useful language oriented hierarchy (such as sections, paragraphs, sentences, and words).
|
![]() |
|
Its interesting that they got good results on 200B and 0.8 epoch training set, but once scaled it to 1T and 4 epoch, got degradation in vast majority of benchmarks (Table 1).
|