![]() |
|
![]() |
| I dropped VSCode when I found out that the remote editing and language server extensions were both proprietary. Back to vim and sorry I strayed. |
![]() |
| The way I read it, the message you replied to was a complaint about parts of VSCode being proprietary. Do you mean to say Jetbrains is pretty ok on the "not being proprietary" front? |
![]() |
| If anyone is interested in trying local AI, you can give https://recurse.chat/ a spin.
It lets you use local llama.cpp without setup, chat with PDF offline and provides chat history / nested folders chat organization, and can handle thousands of conversations. In addition you can import your ChatGPT history and continue chats with local AI. |
![]() |
| Often you'll find there's '-chat-' and '-instruct-' variants of an LLM available.
Trying to chat to an INSTRUCT model will be disappointing, much as you describe. |
![]() |
| The post you're replying to couldn't have made it any easier to answer these questions yourself. No, it won't be as good as the state of the art with massive cloud infrastructure behind an http api. |
![]() |
| @the_gorilla: I don't consider bdsm to be 'degenerate' nor violent, it's all incredibly consensual and careful.
It's just that the LLMs trigger immediately on minor words and shut down completely. |
![]() |
| I should have qualified the meaning of “works perfectly” :) No 70b for me, but I am able to experiment with many quantized models (and I am using a Llama successfully, latency isn’t terrible) |
![]() |
| llamafile will run on all architectures because it is compiled by cosmopolitan.
https://github.com/jart/cosmopolitan "Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable." I use it just fine on a Mac M1. The only bottleneck is how much RAM you have. I use whisper for podcast transcription. I use llama for code complete and general q&a and code assistance. You can use the llava models to ingest images and describe them. |
![]() |
| How do we rate whether the smaller models are any good? How many questions do we need to ask it to know that it can be trusted and we didn't waste our time on it? |
![]() |
| I'm interested, but I can't find any documentation for it. Can I give it local content (documents, spreadsheets, code, etc.) and ask questions? |
![]() |
| Isn’t there also some Firefox AI integration that’s being tested by one dev out there? I forgot the name and wonder if it got any traction. |
![]() |
| Same. My husky/pyr mix needs a lot of exercise, so I'm outside a minimum of a few hours a day. As a result I do a lot of dictation on my phone.
I put together a script that takes any audio file (mp3, wav), normalizes it, runs it through ggerganov's whisper, and then cleans it up using a local LLM. This has saved me a tremendous amount of time. Even modestly sized 7b parameter models can handle syntactical/grammatical work relatively easily. Here's the gist: https://gist.github.com/scpedicini/455409fe7656d3cca8959c123... EDIT: I've always talked out loud through problems anyway, throw a BT earbud on and you'll look slightly less deranged. |
![]() |
| My hope is to make this easy with a GH repo or at least detailed instructions.
I'm on a Mac and I found the easiest way to run & use local models is Ollama as it has a rest interface: https://github.com/ollama/ollama/blob/main/docs/api.md I just have a local script that pulls the audio file from Voice Memos (after it syncs from my iPhone), runs it through openai's whisper (really the best at voice to speech; excellent results) and then makes sense of it all with a prompt that asks for organized summary notes and todos in GH flavored markdown. That final output goes into my Obsidian vault. The model I use is llama3.1 but haven't spent much time testing others. I find you don't really need the largest models since the task is to organize text rather than augment it with a lot of external knowledge. Humorously the harder part of the process was finding where the hell Voice Memos actually stores these audio files. I wish you could set the location yourself! They live deep inside ~/Library/Containers. Voice Memos has no export feature, but I found you can drag any audio recording out of the left sidebar to the desktop or a folder. So I just drag the voice memo into a folder my script watches and then it runs the automation. If anyone has another, better option for recording your voice on an iPhone, let me know! The nice thing about all this is you don't even have to start / stop the recording ever on your walk... just leave it going. Dead space and side conversations and commands to your dog are all well handled and never seem to pollute my notes. |
![]() |
| This is exactly why I think the AI pins are a good idea. The Humane pin seems too big/too expensive/not quite there yet, but for exactly what you're doing, I would like some type of brooch. |
![]() |
| You can use llama.cpp, it runs on almost all hardware. Whisper.cpp is similar, but unless you have a mid or high end nvidia card it will be a bit slower.
Still very reasonable on modern hardware. |
![]() |
| It’s mostly how would you solve this programming problem, or reminders on syntax, scaffolding a configuration file etc.
Often it’s a form of rubber duck programming, with a smarter rubber duck. |
![]() |
| "Dropbox will never work, you can already build such a system yourself quite trivially by getting an FTP account and mounting it locally with curlftpfs" |
![]() |
| I would prefer to have some personal recommendations - I've had some success with Llama3.1-8B/8bits and Llama3.1-70B/1bit, but this is a fast moving field, so I think it's worth the details. |
![]() |
| you moved the goalposts when you add 'multimodal' there; another item is, no one reads PDF tables and illustrations perfectly, at any price AFAIK |
![]() |
| Supposedly submitting screenshots of pdfs (at a large enough zoom per tile/page) to OpenAI gtp4o or Google’s whatever is currently the best way of handling charts and tables. |
![]() |
| Synthetic data (data from some kind of generative AI) has been used in some form or another for quite some time[0]. The license for LLaMA 3.1 has been updated to specifically allow its use for generation of synthetic training data. Famously, there is a ToS clause from OpenAI in terms of using them for data generation for other models but it's not enforced ATM. It's pretty common/typical to look through a model card, paper, etc and see the use of an LLM or other generative AI for some form of synthetic data generation in the development process - various stages of data prep, training, evaluation, etc.
Phi is another really good example but that's already covered from the article. [0] - https://www.latent.space/i/146879553/synthetic-data-is-all-y... |
![]() |
| I think it can be a tradeoff to get to smaller models. Use larger models trained on the whole internet to produce output that would train the smaller model. |
![]() |
| One use case I've found very convenient: partial screenshot |> minicpm-v
Covers 90% of OCR needs with 10% of the effort. No API keys, scripting, or network required. |
![]() |
| What spec people recommend here to run small models like Llama3.1 or mistral-nemo etc.
Also is it sensible to wait for newer mac, amd, nvidia hardware releasing soon? |
![]() |
| M4s are releasing in probably a month or two; if you’re going Apple, it might be worth waiting for either those or the price drop on the older models. |
![]() |
| You actually need a lot less than that if you use the mmap option, because then only activations need to be stored in RAM, the model itself can be read from disk. |
![]() |
| What local models is everyone using?
The last one I used was Llama 3.1 8B which was pretty good (I have an old laptop). Has there been any major development since then? |
![]() |
| That is... probable, if you bought a newish m2 to replace your 5-6 year old macbook pro which is now just lying around. Or maybe you and your spouse can share cpu hours. |
![]() |
| And then the successful chatgpt wrappers with traction will become valuable than the companies creating propietary LLMs. I bet openai will start buying many AI apps to find profitable niches. |
![]() |
| Edit: I just found this. I'll give it a try today: https://github.com/0ssamaak0/SiriLLama
--- Open WebUI has a voice chat but the voices are not great. I'm sure they'd love a PR that integrates StyleTTS2. You can give it a Serper API Key and it will search the web to use as context. It connects to ollama running on a linux box with a $300 RTX 3060 with 12GB of VRAM. The 4bit quant of Llama 3.1 8B takes up a bit more than 6GB of VRAM which means it can run embedding models and STT on the card at the same time. 12GB is the minimum I'd recommend for running quantized models. The RTX 4070 Ti Super is 3x the cost but 7 times "faster" on matmuls. The AMD cards do inference OK but they are a constant source of frustration when trying to do anything else. I bought one and tried for 3 months before selling it. It's not worth the effort. I don't have any interest in allowing it to run shortcuts. Open WebUI has pipelines for integrating function calling. HomeAssistant has some integrations if that's the kind of thing you are thinking about. |
![]() |
| It isn't clear if you can know when the task gets handed off to their servers. But yeah that'd be the closest I know. I'm not sure it would build a local knowledge base though. |
![]() |
| I run a fineruned mulmodal LLM as a spam filter (reads emails as images). Game changer. Removes all the stuff I wouldn't read anyways, not only spam. |
![]() |
| For anyone looking for a simple alternative for running local models beyond just text, Nexa AI has built an SDK that supports text, audio (STT, TTS), image generation (e.g., Stable Diffusion), and multimodal models! It also has a model hub to help you easily find local models optimized for your device.
Nexa AI local model hub: https://nexaai.com/ Toolkit: https://github.com/NexaAI/nexa-sdk It also comes with a built-in local UI to get started with local models easily and OpenAI-compatible API (with JSON schema for function calling and streaming) for starting local development easily. You can run the Nexa SDK on any device with a Python environment—and GPU acceleration is supported! Local LLMs, and especially multimodal local models are the future. It is the only way to make AI accessible (cost-efficient) and safe. |
![]() |
| I saw this demo a few months back - and lost it, of LLM autocompletion that was a few milliseconds - it opened a how new way on how to explore it... any ideas? |
![]() |
| The newest laptops are supposed to have 40-50 TOPS performance with the new AI/NPU features. Wondering what that will mean in practice. |
![]() |
| Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks! |
![]() |
| It's kinda funny how nowadays an AI with 8 billion parameters is something "small". Specially when just two years back entire racks were needed to run something giving way worst performance. |
![]() |
| I'm not sure why anybody would respect that licence term, given the whole field rests on the rapacious misappropriation of other people's intellectual property. |
![]() |
| I don't think local LLM's are being marketed "for the casual user", nor do I think the casual user will care at all about running LLM's locally so I am not sure why this comparison matters. |
It is true that VS Code has some non-optional telemetry, and if VS Codium works for people, that is great. However, the telemetry of VSCode is non-personal metrics, and some of the most popular extensions are only available with VSCode, not with Codium.