![]() |
|
![]() |
| One of the maintainers has a video demo on his twitter claiming iOS, android and Linux. Some of the code is not released and I wish they were advertising that properly. |
![]() |
| That's exciting. So we could build a SETI@home style network of even the largest models.
I wonder if training could be done in this way too. |
![]() |
| Woops! I would have thought the number of neurons roughly equals the number of parameters, but you are right. The number of parameters is much higher. |
![]() |
| I might just still be too tired from just waking up, but I can’t for the life of me find any details on that site about what models are actually being served by the horde? |
![]() |
| Excellent points and being able to use available hardware in unison is amazing and I guess we are not far away from botnets utilising this kind of technology like they did with mining coins. |
![]() |
| GP is likely referring to network latency here. There's a tradeoff between smaller GPUs/etc at home that have no latency to use and beefier hardware in the cloud that have a minimum latency to use. |
![]() |
| We are running smaller models with software we wrote (self plug alert: https://github.com/singulatron/singulatron) with great success. There are obvious mistakes these models make (such as the one in our repo image - haha) sometimes but they can also be surprisingly versatile in areas you don't expect them to be, like coding.
Our demo site uses two NVIDIA GeForce RTX 3090 and our whole team is hammering it all day. The only problem is occasionally high GPU temperature. I don't think the picture is as bleak as you paint. I actually expect Moore's Law and better AI architectures to bring on a self-hosted AI revolution in the next few years. |
![]() |
| Also customizability. Sure, you can fine-tune the cloud hosted models (to a certain degree of freedom), but it will probably be expensive, inefficient, difficult and unmaintainable. |
![]() |
| Cloud cannot be beaten on compute / price, but moving to local could solve privacy issues and the world needs a second amendment for compute anyway. |
![]() |
| Not necessarily, you can fine tune on a general domain of knowledge (people already do this and open source the results) then use on device RAG to give it specific knowledge in the domain. |
![]() |
| That’s where we want to get eventually. There’s a lot of work that needs to be done but I’m confident we’ll get there. Give us 3 months and it’ll be as simple as running Dropbox. |
![]() |
| So I just tried with 2x macbook pros (M2 64GB & M3 128GB) and it was exactly the same speed as with just 1 macbook pro (M2 64GB) Not exactly a common setup but at least it's something |
![]() |
| I look forward to something similar being developed on top of Bumblebee and Axon, which I expect is just around the corner. Because, for me, Python does not spark joy. |
![]() |
| I avoid touching Apple devices but anything that can expose a Linux shell can run the BEAM. There are two main projects for small devices, https://nerves-project.org/ for more ordinary SoC-computers and https://www.atomvm.net/ for stuff like ESP32-chips.
On Android you've got Termux in F-Droid and can pull in whatever BEAM-setup you want. That's how I first started dabbling with the BEAM, I was using a tablet for most of my recreational programming and happened to try it out and got hooked. Erlang is pretty weird, but it just clicks for some people so it's worth spending some time checking it out. Elixir is a really nice Python-/Ruby-like on the BEAM, but with pattern matching, real macros and all the absurdly powerful stuff in the Open Telecom Platform. |
![]() |
| Try it out - don't trust me!
The way this works is that each device holds a partition of the model (for now a continuous set of layers). E.g. let's say you have 3 devices and the model is 32 layers. Device 1 could hold layers 1-10, device 2 holds 11-20 and device 3 holds 21-32. Each device executes the layers it's responsible for and passes on the output of its last layer (the activations) to the next device. The activations are ~8KB for Llama-3-8B and ~32KB for Llama-3-70B (it's linear in the number of parameters in that layer and Llama-3-70B has more layers). Generally the larger the model gets (in terms of parameters), the more layers it ends up having, so we end up with sub-linear scaling so I expect Llama-3-405B to have activations on the order of ~100KB. This is totally acceptable to send over a local network. The main issue you run into is latency, not bandwidth. Since LLMs are autoregressive (tokens are generated serially), additional latency limits throughput. However, over a local network latency is generally very low (<5ms in my experience). And if not, it's still useful depending on the use-case since you can get a lot of throughput with pipeline parallelism (overlapping requests): https://pytorch.org/docs/stable/pipeline.html |
![]() |
| Is this sensible? Transformers are memory bandwidth bound. Schlepping activations around your home network (which is liable to be lossy) seems like it would result in atrocious TPS. |
It requires mlx but it is an Apple silicon-only library as far as I can tell. How is it supposed to be (I quote) "iPhone, iPad, Android, Mac, Linux, pretty much any device" ? Has it been tested on anything else than the author's MacBook ?