（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=39890262

一位经验丰富的开发人员讨论了大型语言模型 (LLM) 对于编码的好处，尤其是处理重复性任务、发现新概念以及为复杂问题生成起点的好处。尽管在调整编码方法方面遇到了最初的挑战，但他们认为法学硕士可以显着提高开发的生产力和效率。作者强调了法学硕士在学习新技术、创建新颖的解决方案和自动化日常活动方面的重要性。此外，他们建议探索 privateGPT 和 simonw 博客等资源以获取更多见解。

Regarding this bit at the end:

> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS

If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?

Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.

Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.

A nice example of this is fftw which has hundreds (if not thousands) of generated methods to do the fft math. The whole project is a code generator.

It can then after compilation benchmark these, generate a wisdom file for the hardware and pick the right implementation.

Compared with that "a few" implementations of the core math kernel seem like an easy thing to do.

Apache TVM does something similar for auto-optimization and last time I checked it wasn't always a win against OpenVINO (depending on the network and batch-size) and it came with lots of limitations (which may have been lifted since) - stuff like dynamic batch size.

I wish we had superoptom

To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.

I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.

Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.

BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.

Sounds kludge, but introduce enough constraints and you end up with this as the best solution.

>> Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?

Good question, the account is muddy --

1. Electricity is a parent company responsibility, so while that is a factor in OpEx price, it isnt a factor for us. I dont think it even gets submetered. Obviously, one wouldnt want to abuse this, but maxing out Macbooks dont seem close to abuse territory

2. The M1/M2/M3 machines are already purchased, so while that is major CapEx, it is a sunk cost and also an underutilized resource most of the day. We assume no wear and tear from maxing out the cores, not sure if that is a perfect assumption but good enough.

3. Local servers are out of the question at a big company outside of infra groups, it would take years to provision them and I dont think there is even a means to anymore.

The real question is cloud. Cloud with RTX/A100 would be far more expensive, though I'm sure performant. (TPM calculation left to the reader :-) I'd leave those for fine tuning, not for inference workloads. Non-production Inference is particularly bad because you cant easily justify reserved capacity without some constant throughput. If we could mix environments, it might make sense to go all cloud on NVIDIA but having separate environments with separate compliance requirements makes that hard.

Jokes aside, I think a TPM calculation would be worthwhile and perhaps I can do a quick writeup on this and submit to HN.

If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.

>> If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap

In such a case, my guess is that the machines being free would be trumped by the increased cost of electricity.

From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.

Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.

In theory, yes.

In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.

It also sat at the wrong level of abstraction for high performance compute, which is why CUDA ended up being used.

Vulkan would have been reasonable to write compute shaders in, if there wasn't a ton of alternatives out there already now

llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.

Here are links to the most recent pull requests sent

    https://github.com/ggerganov/llama.cpp/pull/6414
    https://github.com/ggerganov/llama.cpp/pull/6412

There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.

The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.

Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.

Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.

Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.

I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.

Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.

[0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf

You are right, I just tested this out and my speed from BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I can only imagine what BLIS and MKL would give. I apologize for my ignorance. Apparently my faith in the compilers was wildly misplaced.

using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.

I just did a test of OpenBLAS with Intel-compiled BLAS, and it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I was wrong here. Maybe in some sense 4% is not bad, but it's certainly not good. My faith in current compilers has certainly been shattered quite a bit today.

Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.

The inequality between matrix multiplication implementations is enormous. It gets even more extreme on GPU where I've seen the difference between naïve and cuBLAS going as high as 1000x. Possibly 10000x. I have a lot of faith in myself as an optimization person to be able to beat compilers. I can even beat MKL and hipBLAS if I focus on specific shapes in sizes. But trying to beat cuBLAS at anything makes me feel like Saddam Hussein when they pulled him out of that bunker.

BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.

I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.

Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.

DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.

I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.

I strongly recommend that people run LLMs locally for a different reason.

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.

The other reason is to find out what a detuned model is capable of. The canonical example is how to make cocaine, which ChatGPT will admonish you for even asking, while llama2-uncensored will happily describe the process which is only really interesting if you're an amateur chemist and want to be Scarface-that-knocks. (the recipe is relatively easy, it's getting access to the raw ingredients that's the hard part, same as with nukes.)

if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.

privacy reasons are another reason to try a local LLM. for the extremely paranoid (justified or not), a local LLM gives users a place to ask questions without the text being fed to a server somewhere for later lawsuit discovery (Google searches are routinely subpoenaed, it's only a matter of time until ChatGPT chats are as well.)

There's an uncensored model for vision available as well. The censored vision models won't play the shallow game of hot or not with you.

There are uncensored image generation models as well, but, ah, those are NSFW and not for polite company. (As well as there's multiple thesis' worth of content on what that'll do to society.)

> if you accidentally use the word "hack" [with] ChatGPT...

Side note: ChatGPT is now completely useless for most creative tasks. I'm trying to use it, via NovelCrafter, to help flesh out a story where a minor character committed suicide. ChatGPT refuses to respond, mentioning "self harm" as a reason.

The character in question killed himself before the story even begins (and for very good reasons, story-wise); it's not like one's asking about ways to commit suicide.

This is insane, ridiculous, and different from what all other actors of the industry do, including Claude or Mistral. It seems OpenAI is trying to shoot itself in the foot and doing a pretty good job at it.

OpenAI is angling for enterprise users who have different notions about safety. Writing novels isn't the use case, powering customer service chatbots that will never ever ever say "just kill yourself" is.

My contrarian tendencies now have me thinking of scenarios where a customer service chatbot might need to say "just kill yourself".

Perhaps the HR support line for OpenAI developers tasked with implementing the censorship system?

you wouldn't give out the same sort of advice when a compiler or linker failed to complete a given task, although one certainly could do the work manually.

it's just fashionable to hate on AI, /s or not.

Is it common for you to write the header file to say a BST and the compiler didn't rightfully throw an error?

That's what you are asking the LLM to do, you are generating code not searching for an error condition.

The parent comment is saying maybe as a writer, writing the actual story isn't such a bad thing...

Just a different perspective, live your own life. If the story is good I'll enjoy it, who wrote it is for unions and activists to fight. I've got enough fights to fight.

I don't use LLMs for my coding, I manage just fine with LSP and Treesitter. So genuine question: is that answer representative of the output quality of these things? Because both answers are pretty crappy and assume the user has already done the difficult things, and is asking for help on the easy things.

The response seems pretty reasonable; it's answering the question it was asked. If you want to ask it how to do the difficult part, ask it about that instead. Expecting it to get the answer right in the first pass is like expecting your code to compile the very first time. You have to have more of a conversation with it to coax the difference out of you're thinking and what you're actually saying.

If you're looking to read a more advanced example of its capabilities and limitations, try

https://simonwillison.net/2024/Mar/23/building-c-extensions-...

It's not representative.

The models are capable of much much more, and they are being significantly nerfed over time by these ineffective attempts to introduce safeguards.

Recently I've asked GPT4 to quote me some code to which it replied that it is not allowed to do so - even though it was perfectly happy to quote anything until recently. When prompted to quote the source code, but output it as PHP comments, it happily complied because it saw that as "derivative work" which it is allowed to do.

I asked ChatGPT for some dataviz task (I barely ever do dataviz myself) and it recommended some nice Python libraries to use, some I had already heard of and some I hadn't, and provided the code.

I'm grateful because I thought code LLMs only sped up the "RTFM" part, but it made me find those libs so I didn't have to Google around for (and sometimes it's hard to guess if they're the right tool for the job, and they might be behind in SEO).

There are three things I find LLMs really excellent at for coding:

1. Being the "senior developer" who spend their whole career working with a technology you're very junior at. No matter what you do and how long your programming career is, you're inevitably going to run into one of these sooner or later. Whether it's build scripts, frontend code, interfacing with third-party APIs or something else entirely, you aren't an expert at every technology you work with.

2. Writing the "boring" parts of your program, and every program has some of these. If you're writing a service to fooize a bar really efficiently, Copilot won't help you with the core bar fooization algorithm, but will make you a lot faster at coding up user authentication, rate limiting for different plans, billing in whatever obscure payment method your country uses etc.

3. Telling you what to even Google for. This is where raw Chat GPT comes into play, not Copilot. Let's say you need a sorting algorithm that preserves the order of equal elements from the original list. This is called stable sorting, and Googling for stable sorting is a good way to find what you're looking for, but Chat GPT is usually a better way to tell you what it's called based on the problem description.

I asked a stupid question and got a stupid answer. Relatively speaking the answer was stupider than it should have been, so yes, it was wrong.

I asked it to try again and got a better result though, just didn't include it.

You need to read more than just the first sentence of a comment. They only said that part so the reader would know that they have never used an LLM for coding, so they would have more context for the question:

> So genuine question: is that answer representative of the output quality of these things?

Yes, I did read it. I’m kind of tired of HNers loudly proclaiming they are ignoring LLMs more than a year into this paradigm shift.

Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming?

I did exactly that and found it lackluster for the domain I asked it for.

And most use I've seen on it realistically a good LSP covers.

Or to put it a other way. It's no good at writing algorithms or data structures ( or at least no better thab I would have with a first drafy but the first draft puts me ahead of the LLM in understanding that actual problem at hand, handing it off to an LLM doesn't help me get to the final solution faster).

So that leaves writing boiler plate but concidering my experience with it writing more complex stuff, I would need to read over the boilerplate code to ensure it's correct which in that case I may as well have written it.

> found it lackluster for the domain I asked it for

Fair, that is possible depending on your domain.

> It's no good at writing algorithms or data structures

In my experience, this is untrue. I’ve gotten it to write algorithms with various constraints I had. You can even tell it to use specific function signatures instead of any stdlib, and make changes to tweak behavior.

> And most use I've seen on it realistically a good LSP covers.

Again, I really don’t understand this comparison. LSPs and LLMs go hand in hand.

I think it’s more of a workflow clash. One really needs to change how they operate to effectively use LLMs for programming. If you’re just typing nonstop, maybe it would feel like Copilot is just an LSP. But, if you try harder, LLMs are game changers when:

- maybe you like rubber ducking

- need to learn a new concept and implement it

- or need to glue things together

- or for new projects or features

- or filling in boilerplate based on existing context.

https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50ee5e...

https://godbolt.org/z/s9Yvnjz7K

I mean I could write the algorithm by hand pretty quickly in C++ and would follow the exact same thought pattern but also deal with the edge cases. And factoring in the loss of productivity from the context switch that is a net negative. This algorithm is also not generic over enough cases but that is just up to the prompt.

If I can't trust it to write `strip_whitespace` correctly which is like 5 lines of code, can I trust it to do more without a thorough review of the code and writing a ton of unit tests... Well I was going to do that anyway.

The argument that I just need to learn better prompt engineering to make the LLM do what I want just doesn't sit with me when instead I could just spend the time writing the code. As I said your last point is absolutely the place I can see LLMs being actually useful but then I need to spend a significant amount of time in code review for generated code from an "employee" who is known to make up interfaces or entire libraries that doesn't exist.

I'm a Python-slinging data scientist so C++ isn't my jam (to say the least), but I changed the prompt to the following and asked it to GPT-4:

> Write me an algorithm in C++ which finds the begin and end iterator of a sequence where leading and trailing whitespace is stripped. Please write secure code that handles any possible edge cases.

It gave me this:

https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3cacd...

I'm not sure what other edge cases there might be, however. This only covers one of them.

In general, I've found LLMs to be marginally helpful. Like, I can't ever remember how to get matplotlib to give me the plot I want, and 9 times out of 10 GPT-4 easily gives me the code I want. Anything even slightly off the beaten path, though, and it quickly becomes absolutely useless.

My guess is that this was generated using GPT4?

Free GPT I get https://chat.openai.com/share/f533429d-63ca-4505-8dc8-b8d2e7... which has exactly the same problem as my previous example and doesn't consider the string of all whitespace.

Sure GPT4 is better at that, it wasn't the argument made.

The example you gave absolutely was the code I would write on a first draft since it does cover the edge cases (assuming we aren't dealing with the full UTF charset and all that could be considered a space there).

However this is code that is trivial to write in any language and the "Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming? " argument doesn't hold up. Am I to believe it will implement something more complex correctly. This is also code that would absolutely be in hundreds of codebases so GPT has tons of context for it.

I think you have the mistaken impression that I was arguing with you (certainly my comment makes it clear that I don't feel that LLMs are a panacea). I merely thought that you might be curious how GPT-4 would respond.

> My guess is that this was generated using GPT4?

This is a good guess, since I stated outright that I used GPT-4, and then mentioned GPT-4 later on in the comment.

Yeah honestly, I think you have a completely different expectation and style of usage than what is optimal with LLMs. I don’t have the energy to convince you further, but maybe one day it’ll click for you? No worries either way.

Are you happy with the C code generated there?

I'm not sure there isn't a buffer overflow in the vector_decode code he showed there, likewise I don't see any error checks on the code and I am not familiar with the sqlite api to even know whether errors can be propagated upwards and what error conditions would mean in that code.

This code is probably fine for a quick side project but doesn't pass my smell test for anything close to production ready code.

I definitely would want to see a lot of unit tests around the decode and encode functions with fuzzing and to be honestly that would be the bulk of the work here. That and documentation on this code. Even though he encode function looks correct at first glance.

I also don't see an easy way to actually unit test this code as it is without actually running it through sqlite which outs a lot of dependencies on the unit test.

I would either need to spend a lot more time massaging gpt to get this to a point where I would be fine shipping the code or you know just write it myself.

Like sibling commenter mentioned, simonw’s blog is a great resource.

Regarding your point around being able to whip up the code yourself - the point is to have a decent starting point to save time and energy. Like you said, you know the edge cases so you could skip the boring parts using GPT and focus purely on fixing those. Though, with more prompting (especially providing examples), GPT can also handle that for you.

I have nearly 2 decades of experience as a developer and it took me a while to reorient my flow around LLMs. But now that I have, it’s truly gamechanging.

And since you asked, here’s my system prompt:

You are an experienced developer who follows industry standards and best practices. Write lean code and explain briefly using bullet points or numbered lists. Elaborate only when explaining concepts or making choices. Always mention which file and where to store provided code.

Tech Stack: < insert all the languages, frameworks, etc you’d like to use >

If I provide code, highlight and explain problematic code. Also show and explain the corrected code.

Take a deep breath and think step by step.

Also, always use GPT4 and customize the above to your style and liking.

I will definitely try this out when I have time later in the day.

There is some code I would really prefer not to write that is a decent test case for this and won't expose company code to GPT. Will give feedback when I am done. Maybe you are correct.

I think the point was like "when it comes to programming assistance, auto-completion/linting/and whatever else LSP does and syntax assist from Treesitter, are enough for me".

Though it does come a little off as a comparison. How about programming assistance via asking a colleague for help, Stack Overflow, or online references, code examples, and other such things, which are closer to what the LLM would provide than LSP and treesitter?

Interesting. It was 4. I can't share the chat I had where ChatGPT refused to help because I used the wrong words, because I can't find it (ChatGPT conversation history search when?), but I just remember it refusing to do something because it thought I was trying to break some sort of moral and ethical boundary writing a chrome extension when all I wanted to do is move some divs around or some such.

One time I wanted to learn about transmitter antenna design, just because I’m curious. ChatGPT 4 refused to give me basic information because you could use that to break some FCC regulations (I’m not even living in the US currently)

If you want to be an amateur chemist I recommend not getting your instructions from an LLM that might be hallucinating. Chemistry can be very dangerous if you're following incorrect instructions.

From experience as a failed organic chemist (who happily switched to computational chemistry for reasons of self preservation) I can tell you it's plenty dangerous when you're following correct instructions :^)

I was talking about cow eggs specifically! When ChatGPT et al got out, one of the funniest things to do was ask it about the best recipes for cow egg omelette or camel egg salad, and the LLM would provide. Sadly, most of it got patched somehow.

Oops... Yep, I missed that too. (On the internet, no one knows you're a dog.)

That's funny. It makes me wonder how these statistical mad libs machines will handle the gradual boundaries nature gives us. Almost all mammals give birth live, but not all. Nearly all mammals had mammalian parents, but not all.

Daniel Dennett was making this argument for why we haven't developed reasonable models for the nature of consciousness. It's because we're so sure there will be an absolute classification, and not a gradual accumulation of interacting systems that together yield the phenomenon.

For someone interested in learning about LLMs, running them locally is a good way to understand the internals.

For everyone else, I wish they experience these (locally or elsewhere) weak LLMs atleast once before using the commercial ones just to understand various failure modes and to introduce a healthy dose of skepticism towards the results instead of blindly trusting them to be the facts/truth.

Completely agree. Playing around with a weak LLM is a great way to give yourself a little bit of extra healthy skepticism for when you work with the strong ones.

This skepticism is completely justified since ChatGPT 3.5 is also happily hallucinating things that don't exist. For example how to integrate a different system Python interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)

The abstractions are relatively brittle. If you don't have a powerful GPU, you will be forced to consider how to split the model between CPU and GPU, how much context size you need, whether to quantize the model, and the tradeoffs implied by these things. To understand these, you have to develop a basic model how an LLM works.

By interacting with it. You see the contours of its capabilities much more clearly, learn to recognize failure modes, understand how prior conversation can set the course of future conversation in a way that's almost impossible to correct without starting over or editing the conversation history.

If you have an >=M1-class machine with sufficient RAM, the medium-sized models that are on the order of 30GB in size perform decently on many tasks to be quite useful without leaking your data.

I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.

I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.

Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.

In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.

    [1] https://github.com/Mozilla-Ocho/llamafile/
    [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true
    [3] https://cosmo.zip/pub/cosmos/bin/assimilate

Check out PrivateGPT on GitHub. Pretty much just works put of the box. I got Mistral7B running on a GTX 970 in about 30 minutes flat first try. Yep, that's the triple-digit GTX 970.

Looks great. Can you recommend what GPU to get to just play with the models for a bit? (I want to have perform it fast, otherwise I lose interest too quickly). Are consumer GPUs like the RTX 4080 Super sufficient, or do I need anything else?

Why is this both free and closed source? Ideally, when you advertise privacy-first, I’d like to see a GitHub link with real source code. Or I’d rather pay for it to ensure you have a financial incentive to not sell my data.

It will be paid down the road, but we are not there yet. It’s all offline, data is locally saved. You own it, we don’t have it even if you ask for it.

30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.

To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).

An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.

Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.

Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.

It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.

I don't really think this is true, you can't really extrapolate the strengths and weaknesses of bigger models from the behavior of smaller/quantized models and in fact a lot of small models are actually great at lots of things and better at creative writing. If you want to know how they work, just learn how they work, it takes like 5 hours of watching Youtube videos if you're a programmer.

Sure, you can't extrapolate the strengths and weaknesses of the larger ones from the smaller ones - but you still get a much firmer idea of what "they're fancy autocomplete" actually means.

If nothing else it does a great job of demystifying them. They feel a lot less intimidating once you've seen a small one running on your computer write a terrible haiku and hallucinate some non-existent API methods.

It's funny that you say this, because the first thing I tried after ChatGPT came out (3.5-turbo was it?) was writing a haiku. It couldn't do it at all. Also, after 4 came out, it hallucinated an api that wasted a day for me. It's an api that absolutely should have existed, but didn't. Now, I frequently apply llm to things that are easily verifiable, and just double check everything.

Local LLMs are also a fantastic too for creative endeavors. Without prompt injection and having the ability to modify the amount of noise and "creativity" in the output, absolutely bonkers things pop out.

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

Totally. I recently asked a locally-run "speed" LLM for the best restaurants in my (major) city, but it spit out restaurants opened by chefs from said city in other cities. It's not a thing you'd want to rely on for important work, but is still quite something.

Who cares, a local LLM still knows way way more practical knowledge than you, and without internet would provide a ton of useful information. Not surprised by this typical techy attitude - something has to be 'perfect' to be useful.

I mean kinda. But there's a good chance this is also misleading. Lots of people have been fooled into thinking LLMs are inherently stupid because they have had bad experiences with GPT-3.5. The whole point is that the mistakes they make and even more fundamentally what they're doing changes as you scale them up.

Kiwix provides prepackaged highly compressed archives of Wikipedia, Project Gutenberg, and many other useful things: https://download.kiwix.org/zim/.

Between that and dirt cheap storage prices, it is possible to have a local, offline copy of more human knowledge than one can sensibly consume in a lifetime. Hell, it's possible to have it all on one's smartphone (just get one with an SD card slot and shove a 1+ Tb one in there).

Just create a RAG with wikipedia as the corpus and a low parameter model to run it and you can basically have an instantly queryable corpus of human knowledge runnable on an old raspberry pi.

“For RAG” is ambiguous.

First there is a leaderboard for embeddings. [1]

Even then, it depends how you use them. Some embeddings pack the highest signal in the beginning so you can truncate the vector, while most can not. You might want that truncated version for a fast dirty index. Same with using multiple models of differing vector sizes for the same content.

Do you preprocess your text? There will be a model there. Likely the same model you would use to process the query.

There is a model for asking questions from context. Sometimes that is a different model. [2]

> actually distinguish between real life and fantasy

Are LLMs unable to distinguish between real life and fantasy? What prompts have you thrown at them to make this determination? Sending a small fairy tale and asking the LLM if it thinks it's a real story or fake one?

... having them talk about events from sci fi stories in response to questions about the real world. Having them confidently lie about pretty much everything. Etc.

What are the specific prompts you're using? You might get those answers when you're not being specific enough (or use models that aren't state of the art).

"Shit in, shit out" as the saying goes, but applied to conversations with LLMs where the prompts often aren't prescriptive enough.

I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.

There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.

That's where humans suck. The classic "you're not doing it right" then proceeds to quickly show how to do it without verbalizing any info on learning process, pitfalls, failure modes, etc, as if just showing it was enough for themselves to learn. Most people do[n't do] that, not even a sign of reflection.

My worst case was with a guy who asked me to write an arbitrage betting bot. When I asked how to calculate coeffs, he pointed at two values and said "look, there , there thinks for a minute then it's !". When I asked how exactly did he calculate it, he simply repeated with different numbers.

People often don't know how to verbalize them in the first place. Some of these topics are very complex, but our intuition gets us halfway there.

Once upon a time I was good at a video game. Everyone realized that positioning is extremely important in this game.

I have good positioning in that game and was asked many times to make a guide about positioning. I never did, because I don't really know how. There is too much information they you need to convey to cover all the various situations.

I think you would first have to come up with a framework on positioning to be able to really teach this to someone else. Some kind of base truths/patterns that you can then use to convey the meaning. I believe the same thing applies to a lot of these processes that aren't verbalized.

Often for this kind of problem writing a closed form solution is simply intractable. However, it's often still possible to express the cost function of at least a big portion of what goes into a human-optimal solution. From here you can sample your space, do gradient descent or whatever to find some acceptable solution that has a more human-intuitive property.

It's not necessarily that it's intractable - just that a thing can be very hard to describe, under some circumstances.

Imagine someone learning English has written "The experiment reached it's conclusion" and you have to correct their grammar. Almost any english speaker can correct "it's" to "its" but unless they (and the person they're correcting) know a bunch of terms like 'noun' and 'pronoun' and 'possessive' they'll have a very hard time explaining why.

I wouldn't say this is where humans suck. On the contrary, this how we find human language is such a fantastic tool to serialize and deserialize human mental processes.

Language is so good, that an artificial language tool, without any understanding of these mental processes, can appear semi-intelligent to us.

A few people unable to do this serialization doesn't mean much on the larger scale. Just that their ideas and mental processes will be forgotten.

For sure agree, however as the storage of information evolves, it’s becoming more efficient over time

From oral tradition to tablets to scrolls to books to mass produced books to digital and now these LLMs, I think it’s still a good idea to preserve what we have the best we can. Not as a replacement, but a hedge against a potential library of Alexandria incident.

I could imagine a time in the near future where the models are domain-specific, and just like there are trusted encyclopedia publishers there are trusted model publishers that guarantee a certain level of accuracy.

It’s not like reading a book, but I for sure had an easier time learning golang talking with ChatGPT than a book

> a hedge against a potential library of Alexandria incident

What would cause a Library of Alexandria incident wiping out all human knowledge elsewhere, that would also allow you to run a local LLM?

To run a local LLM you need the device it currently runs on and electricity. There are actually quite a lot of ways to generate electricity, but to name one, a diesel generator that can run on vegetable oil.

What you're really asking is, what could cause a modern Library of Alexandria incident? But the fact is we keep the only copy of too many things on the servers of the major cloud providers. Which are then intended to have their own internal redundancy, but that doesn't protect you against a targeted attack or a systemic failure when all the copies are under the same roof and you lose every redundant copy at once from a single mistake replicated in a monoculture.

A more dooms-day prepping would call for some heavy lead-faraday cage to store the storage mediums in the event of an EMP/major solar flare.

Or more Sci-fi related, some hyper computer virus that ends up infecting all internet connected devices.

Not too far fetched if we can conceive of some AI enabled worm that mutates depending on the target, I could imagine a model of sorts being feasible within the next 5-10 years

society depends much more on social networks, mentorship and tacit knowledge than books. It's easy to test this. Just run the thought experiment by a few people, if you could get only one, would you take an Ivy league degree without the education or the education without the degree?

Venture capital in tech is a good example of this. The book knowledge is effectively globally distributed and almost free, effectively success happens in a few geographically concentrated counties.

> I contend that most human knowledge is not written down

Yes - the available training data is essentially mostly a combination of declarative knowledge (facts - including human-generated artifacts) and procedural knowledge (how to do things). What is missing is the learning process of taking a description of how to do something, and trying to apply that yourself in a specific situation.

No amount of reading books, or reading other people's blogs on how they did something, can avoid the need for hands-on experience if you want to learn how to do it yourself.

It's not just a matter of information that might be missing or unclear in instructional material, including how to cope with every type of failure and unexpected outcome, but crucially how to do this yourself - if you are to be the actor, then it's the predictive process in your mind that matters.

Partly for this reason, and partly because current AI's (transformer-based LLMs) don't support online learning (try & fail skill acquisition), I think we're going to see two distinct phases of AI.

1) The current "GenAI" phase where AI can only produce mash-ups of things it saw in it's pre-training data, augmented by similar "book learning" provided in-context which can be utilized by in-context learning. I'd characterize what this type of AI to be useful for, and capable of, as "automation". Applying that book (incl. anecdotal) knowledge to new situations where mash-up is all you need.

2) The second phase is where we have something closer to AGI, even if still below human level, which is no longer just a pre-trained transformer, but also has online learning and is agentic - taking actions predicated on innate traits like curiosity and boredom, so that given the book knowledge it can (& will!) then learn to apply that by experimentation/practice and learning from its own mistakes.

There will no doubt be advances beyond this "phase two" as well, but it seems we're likely to be stuck at "phase one" for a while (even as models become much better at phase one capabilities), until architectures fundamentally advance beyond transformers to allow this type of on-the-job training and skill acquisition.

It's not even "human knowledge" that can't be written down - it seems all vertebrates understand causality, quantity (in the sense of intuitively understanding what numbers are), and object permanence. Good luck writing those concepts down in a way that GPT can use!

In general AI in 2024 is not even close to understanding these ideas, nor does any AI developer have a clue how to build an AI with this understanding. The best we can do is imitating object permanence for a small subset of perceptible objects, a limitation not found in dogs or spiders.

Wait till all the videos ever created are tokenized and ingested into a training dataset. Carpentry techniques are certainly there. The subtleties of parenting maybe harder to derive from that, but maybe lots of little snippets of people’s lives will add up to a general understanding of parenting. There have certainly been bigger surprises in the field.

What about smells or tastes? Or feelings?

I can't help but feel we're at the "aliens watch people eat from space and recreate chemically identical food that has no taste" phase of AI development.

If the food is chemically identical then the taste would be the same though, since taste (and smell) is about chemistry. I do get what you're saying though.

An interesting thought experiment, but there's a flaw in it, an implicit fallacy that's probably a straw man. On its own, the argument would likely stand that Mary gains new knowledge on actually being exposed to color.

However, there is a broader context: this is supporting an argument against physicalism, and in this light it falls apart. There are a couple of missing bits required to complete the experiment in this context. The understanding that knowledge comes in 2 varieties: direct (actual experience) and indirect (description by one with the actual experience using shared language). This understanding brings proper clarity to the original argument, as we are aware - I think - that language is used to create compressed representations of things; something like a perceptual hash function.

The other key bit, which I guess we've only considered and extensively explored after the argument was formulated, is that all information coming in via the senses goes to the brain as electrical signals. And we actually have experimental data showing that sensory information can be emulated using machines. Thus, the original argument, to be relevant to the context, should be completed by giving Mary access to a machine that she can program to emulate the electrical signals that represent color experience.

I posit that without access to that hypothetical machine, given the context of the experiment, it cannot be said that Mary has "learned everything there is to learn about color". And once she has comprehensively and correctly utilized said machine on herself, she will gain no new knowledge when she is exposed to the world of color. Therefore this experiment cannot be used as an argument against physicalism as originally intended.

I'd say that, when it comes to chemistry, only 100% reproduction can be considered identical. Anything less is to be deemed similar to some degree.

And so without the correct amount of salt and/or spices, we're talking about food that's very similar, and not identical.

Their perception is very likely to be totally different.

* They might not perceive some substances at all, others that we don't notice might make it unpalatable.

* Some substances might be perceived differently than us, or be indistinguishable from others.

* And some might require getting used to.

Note that all of the above phenomena also occur in humans because of genetics, cultural background, or experiences!

This may come off as pedantic, but "identical" is a very strong term when it comes to something like chemistry. The smallest chemical difference can manifest as a large physical difference. Consider that genetically, humans are about 60% similar to the fruit fly, yet phenotically, the similarity could be considered under 1%.

Well, I have synesthetic smell/color senses, so I don’t even know what other humans experience, nor they me. But, I have described it in detail to many people and they seem to get the idea, and can even predict how certain smells will “look” to me. All that took was using words to describe things.

How rude, what do our bathing habits have to do with this? ;-)

But, fair point. The gist I was trying to get across is that I don't even know what a plant smells like to you, and you don't know what a plant smells like to me. Those aren't comparable with any objective data. We make guesses, and we try to get close with our descriptions, which are in words. That's the best we can do and we share our senses. Asking more from computers seems overly picky to me.

I think we can safely say that any taste, smell, sensation or emotion of any importance has been described 1000 times over in the text corpus of GPT. Even though it is fragmented, by sheer volume there is enough signal in the training set, otherwise it would not be able to generate coherent text. In this case I think the map (language) is asymptotically close to the territory (sensations & experience in general).

I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.

It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.

Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.

> if I were trying to restart civilization and was only left with ChatGPT

In this scenario you’d need to also be left with a big chunk of compute, and power infrastructure. Since ChatGPT is the front end of the model you’d also need to have the internet still going in a minimum capacity.

If we're playing this game, you forgot to mention that they also need: A monitor, a keyboard, roof over their head (to prevent rain from entering your electronic), etc etc...

But really, didn't you catch the meaning of parents message, or are you being purposefully obtuse?

I think re-imagining the "Dr. Stone" series with the main character replaced by an LLM will be a funny & interesting series if we decide to stay true to LLMs nature and make it hallucinate as well.

Given the way LLMs are right now, I suspect there will be lot of failed experiments and the kingdom of science will not advance that quick.

> the kingdom of science will not advance that quick.

It’s more likely that it wouldn’t even start. The first step to any development was figuring out nitric acid as the cure to the petrification. Good luck getting any LLM to figure that out. Even if it did, good luck getting any of the other characters to know what to do with that information that early on.

I don't see LLMs as a large chunk of knowledge, I see them as an emergent alien intelligence snapshotted at the moment it appeared to stop learning. It's further hobbled by the limited context window it has to use, and the probabilistic output structure that allows for outside random influences to pick its next word.

Both the context window and output structure are, in my opinion, massive impedance mismatches for the emergent intellect embedded in the weights of the model.

If there were a way to match the impedance, I strongly suspect we'd already have AGI on our hands.

Disagree. The input/output structure (tokens) is the interface for both inference and for training. There is an emergent intellect embedded in the weights of the model. However, it is only accessible through the autoregressive token interface.

This is a fundamental limitation, much more fundamental than appears at first. It means that the only way to touch the model, and for the model to touch the world, is through the tokenizer (also, btw, why tokenizer is so essential to model performance). Touching the world through a tokenizer is actually quite limited.

So there is an intelligence in there for sure, but it is locked in an ontology that is tied to its interface. This is even more of a limitation than e.g. weights being frozen.

They don't think, they don't reason, they don't understand. Except they do. But it's hard for human words for thought processes to apply when giving it an endless string of AAAAA's makes it go bananas.

That's not familiar behavior. Nor is the counting reddit derived output. It's also not familiar for a single person to have the breadth and depth of knowledge that ChatGPT has. Sure, some people know more than others, but even without hitting the Internet, it has a ridiculous amount of knowledge, far surpassing a human, making it, to me, alien. though, it's inability to do math sometimes is humanizing to me for some reason.

ChatGPT's memory is also unhuman. It has a context window which is a thing, but also it only knows about things you've told it in each chat. Make a new chat and it's totally forgotten the nickname you gave it.

I don't think of HR Geiger's work, though made by a human, as familiar to me. it feels quite alien to me, and it's not just me, either. Dali, Bosch, and Escher are other human artists who's work can be unfamiliar and alien. So being created by our species doesn't automatically imbue something with familiar human processes.

So it dot products, it matrix multiplies, instead of reasoning and understanding. It's the Chinese room experiment on steroids; it turns out a sufficiently large corpus on a sufficiently large machine does make it look like something"understands".

The context window is comparable to human short-term memory. LLMs are missing episodic memory and means to migrate knowledge between the different layers and into its weights.

Math is mostly impeded by the tokenization, but it would still make more sense to adapt them to use RAG to process questions that are clearly calculations or chains of logical inference. With proper prompt engineering, they can process the latter though, and deviating from strictly logical reasoning is sometimes exactly what we want.

The ability to reset the text and to change that history is a powerful tool! It can make the model roleplay and even help circumvent alignment.

I think that LLMs could one day serve as the language center of an AGI.

The word "alien" works in this context but, as the previous commenter mentioned, it also carries the implication of foreign origin. You could use "uncanny" instead. Maybe that's less arbitrary and more specific to these examples.

"Alien" still works, but then you might have to add all the context at length, as you've done in this last comment.

Hype people do this all the time - take a word that has a particular meaning in a narrow context and move it to a broader context where people will give it a sexier meaning.

    AI researchers unveil alien intelligence

Is way better headline.

In all fairness, going up to SMS random human and yelling AAAAAAAAAAAAAA… at them for long enough will produce some out-of-distribution responses too.

Makes me think that TikTok and YT pranksters are accidentally producing psychological data on what makes people tick under scenarios of extreme deliberate annoyance. Although the quality (and importance) of that data is obviously highly variable and probably not very high, and depends on what the prank is.

They can write in a way similar to how a human might write, but they're not human.

The chat interfaces (Claude, ChatGPT) certainly have a particular style of writing, but the underlying LLMs are definitely capable of impersonating as our species in the medium of text.

But they're extremely relatable to us because it's regurgitating us.

I saw this talk with Geoffrey Hinton the other day and he said he was astonished at the capabilities of ChatGPT-4 because he asked it what the relationship between a compost heap and a nuclear bomb was, and he couldn't believe it answered, he really thought it was proof the thing could reason. Totally mind blown.

However I got it right away with zero effort.

Either I'm a super genius or this has been discussed before and made it's way into the training data.

Usual disclaimer: I don't think this invalidates the usefulness of AI or LLMs, just that we might be bamboozling ourselves into the idea that we've created an alien intelligence.

We used to have a test (Turing test) that could quite reliably differentiate between AI and our own species over the medium of text. As of now, we do not seem to have a simple & reliable test like that anymore.

> Either I'm a super genius or this has been discussed before and made it's way into the training data.

If an LLM can tell you the relatonship between a compost heap and nuclear bomb, that doesn't mean that was in the training data.

It could be because a compost heap "generates heat", and a nuclear bomb also "generates heat" and due to that relationship they have something in common. The model will pick up on these similar patterns. They tokens are positioned closer to each other in the high dimensional vector space.

But for any given "what does x have in common with y", that doesn't necessarily mean someone has asked that before and it's in the training data. Is that reasoning? I don't know ... how does the brain do it?

I mean that’s what sucks about Open AI isn’t it ? They won’t tell us what is in the training data so we don’t know. All I’m saying is that it wouldn’t be surprising if this was discussed previously somewhere in a pop science book.

That answer was close btw !

Working with pure bytes is one option that's being researched. That way you're not really constrained by anything at all. Sound, images, text, video, etc. Anything goes in, anything comes out. It's hard to say if it's feasible with current compute yet without tokenizers to reduce dimensionality.

It is invaluable to have a chunk of human knowledge that can tell you things like the Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames

The facts LLMs learned from training are fuzzy, unreliable, and quickly outdated. You actually want retrieval-augmented generation (RAG) where a model queries an external system for facts or to perform calculations and postprocesses the results to generate an answer for you.

Is there a name for the reverse? I'm interested in having a local LLM monitor an incoming, stateful data stream. Imagine chats. It should have the capability of tracking the current day, active participants, active topics, etc - and then use that stateful world view to associate metadata with incoming streams during indexing.

Then after all is indexed you can pursue RAG on a richer set of metadata. Though i've got no idea what that stateful world view is.

This is an interesting idea but I'm having trouble understanding what you're to achieve. Do you mean the LLM would simply continuously update it's context window with incoming data feeds realtime, and you use it as an interface? That's pretty akin to summarization task, yes? Or are you augmenting the streams with "metadata" you mentioned?

According to ChatGPT

> Australia won the 1987 Cricket World Cup. The 1986 date is incorrect; there was no Cricket World Cup in 1986. The tournament took place in 1987, and Australia defeated England in the final to win their first title.

https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489...

I'm no cricket fan, so someone will have to correct Wikipedia if that's wrong.

If you want to point out that LLMs hallucinate, you might want to speak plainly and just come out and say it, or at least give a real world example and not one where it didn't.

I ran 'who won the 1986 Cricket World Cup' against llama2-uncensored (the local model I have pre-downloaded) and hilarious got 5 different answers asking it 5 times:

    >>> who won the 1986 Cricket World Cup
    India
    
    >>> who won the 1986 Cricket World Cup
    Australia
    
    >>> who won the 1986 Cricket World Cup
    New Zealand
    
    >>> who won the 1986 Cricket World Cup
    West Indies
    
    >>> who won the 1986 Cricket World Cup
    England

Which proves GP's point about hallucinations, though none of those are

> Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames

LLM's hallucinations are insidous because they have the ring of truth around them. yards and frames aren't cricket terms, so we're off to the races with them.

You should specify the model size and temperature.

For fact retrieval you need to use temperature 0.

If you don't get the right facts then try 34b, 70b, Mixtral, Falcon 180b, or another highly ranked one that has come out recently like DBRX.

An LLM will always give the same output for the same input. It’s sorta like a random number generator that gives the same list of “random” numbers for the same seed. LLMs get a seed too.

> If you want factual answers from a local model it might help to turn the temperature down.

This makes sense. If you interact with a language model and it says something wrong it is your fault

You're not "interacting with a language model", you're running a program (llama.cpp) with a sampling algorithm which is not set to maximum factualness by default.

It's like how you have to set x264 to the anime tuning or the film tuning depending on what you run it on.

It's a very underrated side effect of this whole LLM thing: We've created a super compact representation of human knowledge in a form that requires a FAR less complex tech stack to get the information 'out' of in the future.

A year ago, a lot of this information only existed on the internet, and would have been nearly impossible to recover in any cohesive unfragmented form if the lights were to ever go out on our civilization.

Now the problem space has moved simply to "find a single solitary PC that will still boot up", and boom, you have access to everything.

I think we just created our Rosetta stone.

Language models are an inefficient way to store knowledge; if you want to have a “pseudo-backup of a large chunk of human knowledge,” download a wikipedia dump, not an LLM.

If you want a friendly but fallible UI to that dump, download an LLM and build a simple ReAct framework around it with prompting to use the wikipedia dump for reference.

Are they though? They are lossy compressing trillions of tokens into a few dozen GB. The decompression action is fuzzy and inefficient though.

And it requires massive computational power to decompress, which I don't expect to be available in a catastrophic situation where humans have lost a large chunk of important knowledge.

I don't necessarily agree. It requires massive computing power, but running models smaller than 70G parameters is possible on consumer hardware, albeit slowly.

Parent may be thinking more along the lines of a “hope we can print all the knowledge“ type catastrophe. Though if there is zero compute it’ll be tough reading all those disks!

I use a tool called LM Studio, makes it trivial to run these models on a Mac. You can also use it as a local API so it kinda acts like a drop-in replacement for the openAI API.

This looks amazing, but the docs mention .llamafiles exceed the Windows executable size limit, and there are workarounds to externalize the weights. Do you think this is an impediment to its becoming popular? Or is MS consumer hardware just far enough behind (w/o dedi gpu) that “there’s time”?

You remember those fantasies where you got up from your seat at the pub and punched the lights out of this guy for being rude? A lot of us have fantasies of being the all powerful oracle that guides a reboot of civilization using knowledge of science and engineering.

It’s kind of crazy really. Before LLMs, any type of world scale disaster you’d hope for what? Wikipedia backups? Now, a single LLM ran locally would be much more effective. Imagine the local models in 5 years!

There's a lot more than just Wikipedia that gets archived, and yes, that is a far more sensible way to go about it. For one thing, the compute required to then read it back is orders of magnitude less (a 15 year old smartphone can handle it just fine). For another, you don't have to wonder how much of what you got back is hallucinated - data is either there or it's corrupted and unreadable.

The processing required to run current language models with a useful amount of knowledge encoded in them is way more than I imagine would be available in a "world scale disaster".

Great links, especially last one referencing the Goto paper:

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...

>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references

It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.

The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.

> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.

This is great. I love the idea of measuring performance differences in “years of Moore’s law.”

Twenty years puts the delta in an easy to understand framework.

Python on 2024 hardware vs C++ on 2004 hardware ... I don't think it's obvious that C++ always wins here, though it would depend on the use case, how much of the Python is underpinned by native libraries, and the specific hardware in question.

Yes, but many people like the sound of "X-times faster than Python" while conveniently forgetting that the same thing can be (and usually is) done in Python + numpy & co. even faster.

I have come to appreciate "slowness" of Python. It trades speed for legibility, which is a great compromise once you have really fast native libraries one import away. Best of both worlds.

C++ with well-optimized libraries should always outperform Python with well-optimized libraries, right? They should be ~identical in the highly optimized inner loops, but Python has more overhead. But naive hand-written C++ could easily perform worse than something like Numpy.

(I've only tested this once, and my naive hand-written C++ was still twice as fast as Numpy, but that was only on one specific task.)

Honestly depends on what you are doing. Most of my python work is data collection and analysis on top of Postgres.

Being smart in how I use Postgres indexing (and when to disable it outright) has more performance impact than the actual language doing the plumbing.

Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".

> You don't need a large computer to run a large language model

While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.

Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.

That doesn’t mean “you don’t need a computer to run an LM”…

I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.

I don’t realllly believe you can do a lot of useful LLM work on a pi

Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.

You can fine-tune a 60mm parameter (e.g. distilBERT) discriminative (not generative) language model and it's one or two order of magnitude more efficient for classification tasks like sentiment analysis, and probably similar if not more accurate.

Yup, I'm not saying TinyLLAMA is minimal, efficient, etc (indeed, that is just saying that you can take models even smaller). And a whole lot of what we just throw LLMs at is not the right tool for the job, but it's expedient and surprisingly works.

Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.

（评论） (comments)

（评论）
(comments)