原文
![]() |
|
![]() |
|
No opinion on the specifics of this distinction, but it's worth noting that in research, an awful lot of successful projects have their origins in failed projects of decades ago...
|
![]() |
|
That is exactly how LLM inference is performed, so I'm being cheeky (I'm 99% sure anyone proposing anything in this thread is someone handwaving based on limited understanding)
|
![]() |
|
Yes, I think training this model would be hard. Perhaps something akin to how MoEs are trained where you impose some sort of loss distribution to encourage equitable routing, but for recursion.
|
![]() |
|
Attention is basically routing, these other routing schemes put a less fine-grained choice for the model, which potentially makes it easier to train
|
![]() |
|
The trendline is definitely toward increasing dynamic routing, but I suspect it's more so that MoE/MoD/MoDE enable models to embed additional facts with less superposition within their weights than enable deeper reasoning. Instead I expect deeper reasoning will come through token-wise dynamism rather than layer-wise -- e.g., this recent Quiet-STaR paper in which the model outputs throwaway rationale tokens: https://arxiv.org/abs/2403.09629
|
![]() |
|
I understand this is ELI5, but doesn’t attention already do this, in the way you described? It pays specific focus to the most contextual words in the prior sequence.
|
![]() |
|
Nice writing. Reminds me of New Scientist style. (I like NS so that is a compliment). I think the “explain as you go along but be brief style”. Which is nice for getting a feel for the space.
|
![]() |
|
It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you "deploy to a single expert which can be dynamically skipped"
|
![]() |
|
The funny thing is, I have 8 3090s which last epoch would have put in like - top 1% of compute. Now, still a lot of compute but pales in comparison to the 100x H100 GPU clusters we're seeing today.
|
![]() |
|
It doesn’t. It simply trades compute efficiency by transposing matrix multiplications into “the future.” It doesn’t actually save FLOPs (uses more) and doesn’t work at large batch size
|
Specifically, I think at some point we are going to move to recursive routing, ie. pass back through a set of experts again. In the future, 'chain-of-thought' will happen internal to the model recursively