(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39973467

这位蜡烛开发人员讨论了他们在提高深度学习模型的训练和推理效率方面所做的工作,特别关注使用 Mojo 而不是 JAX 或 CUDA 的 nanocondaGPT。 他们承认,由于 Python 的便利性和丰富的功能,将项目从 Python 转换为其他编程语言面临着挑战。 争论的焦点是重要的、复杂的算法通常是由人类发现的还是主要由机器生成,认为大多数历史上著名的算法往往相对简单。 此外,讨论还涉及人类处理大量信息的认知能力的局限性以及对概念简单性的偏好。 开发人员提到了为理解特定代码的内部工作原理所做的持续努力,例如 Andrej Karpathy 的训练方法,并推荐了他的初学者教程等资源。 人们对各种形式的计算机内存之间的差异存在疑问,特别是带有板载内存的独立 GPU 与传统计算机芯片之间的差异,以及与成本、功耗和性能相关的潜在权衡的猜测。 此外,对话还包括有关科学技术社区内有效沟通和共享知识重要性的评论,鼓励学习者通过教育材料和教程更深入地研究这些主题。 最后,简要提及 NVIDIA CUDA 在 AI 应用程序中的主导地位,以及对 AMD 和英特尔在兼容性和支持方面存在的明显问题的一些批评。

相关文章

原文


Candle dev here, we also support training/backdrop! We certainly focus on optimizing inference performance but hopefully that should improve the training efficiency too.


Very nice.

In my experience much of the complexity of numerical software is to enable the search for the algorithm that works well with the problem/data you have. Once you know the exact algorithm you want, it is possible to make a nice clean minimalistic implementation, but that does not mean such an implementation would have been easy at the beginning.



I've seen his nano GPT implemented using JAX, now we have C/CUDA. I'd love to see if nano GPT could be doable in Mojo. I took a stab at a Mojo conversion of his Wavenet project (Andrej's zero to hero course) and I gotta say... python has so many nice features lol. Stating the obvious I know but what you see done in 6 lines of python takes so much more work in other languages.


They only made Mojo available outside the preview circle about a couple of months ago, and it is yet to run on Windows laptops of researchers.

I love the attitude of considering 0.x languages production ready for all imaginable kinds of workloads.



A JPEG decoder or TCP stack are very clearly not individual concepts though. There's obviously some subjectivity as to what constitutes a single "concept" or "algorithm", but I'm not sure either of those two examples are in a gray area.

A single concept might be implementing just ARP or a discrete cosine transform. If you wanted to do a full TCP stack or JPEG decoder, that would make a lot more sense after building their internal components one by one.



That's a good question. Unfortunately I think you're asking to compute the Kolmogorov complexity of every interesting concept we have that doesn't yet have an implementation less than n=1000 lines, which is equivalent to the halting problem (modulo unbounded memory).

If you could exhaustively list all the interesting algorithms (hard but feasible) you could potentially prove a lower bound for each one's complexity by writing a shorter than n implementation (hard, probably infeasiblel and show positively that GP's prop isn't true. On the other hand showing that it was true would require either some very clever proof which can't apply to all programs, but somehow only these interesting ones (very likely impossible) or enumerate all C^n programs where C is the number of possible lines (something like 64^80) and show that none of them implements at least one of the interesting algorithms (absurdly impossible).



You are right but I think that there's a more interesting question: do humans stumble upon those large interesting/great algorithms in practice?

The key point here is that we are looking at algorithms already discovered in human history rather than enumerating all possible interesting algorithms. Of course there is an interesting algorithm that is very large, but humans don't discover them in practice. If you look up a list of greatest algorithms in history, they will be rather small in length. Many of them can be sketched in a whiteboard

I think that what is happening here is that our minds just can't hold billions of concepts at once. So if you have an algorithm with billions of things, it was most likely produced by a machine. Handcrafted things, on the other hand, are smaller in comparison

Another thing is that our minds like conceptual simplicity and view simplicity as a kind of beauty. So if we have a great algorithm but it is too large, we look for ways to express them in succinct ways (the right abstractions can help with that, and also help with understanding the algorithm better). We end up succeeding because the algorithms themselves had low Kolmogorov complexity (and thus, if they are too large they probably can be further compressed)



> direct CUDA implementation, which will be significantly faster and probably come close to PyTorch.

It almost hurts, to read that PyTorch is faster.

But then again, with these GPU-RAM-prices, let's see how it speeds up the CPU.

We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?



Memory speed is more or less directly proportional to how close the memory is to the processor, with the fastest memory being literally inside the processor (SRAM cache), followed by memory on the same package as the processor (HBM GPUs, Apple M-series), followed by soldered down discrete memory chips (regular GPUs, games consoles), followed by socketed DIMMs in distant last place. There's not really any getting around it, the bandwidth that GPUs crave just isn't compatible with modularity.

Even CPUs are starting to move their memory closer to the core in the name of performance, as mentioned Apple is already doing it, Intel is making Xeons with on-chip memory now, and they have a version aimed at consumers on their roadmap.



FYI, most discrete GPUs with discrete memory packages soldered to the board near the GPU are running at substantially higher memory frequencies than the on-package DRAM in Apple's chips. But running GDDR at those speeds costs a lot of power.


I watched a presentation on this today. The presenter focused on the soldering and proximity as well. Is this really the only difference or is this transistor based memory (like L1, L2, etc.)? I get the proximity factor of course (1ft / ns EE rule of thumb). In any case, soldering and proximity don't seem like breakthrough innovations (but maybe I am wrong).


Gpu ram is typically gddr6 or gddr6x which is a different standard to the chips used in ddr5 for example. GPUs have terrible latency to ram, but enormous throughput, and I assume the chips are internally optimized for that. Many aspects of a design change when you choose different latency or clockspeed targets translating into different power / area calculations.


That's true it's got an impact, but I think there's still space available for "slightly slower with 2x memory" models. For many local uses, new cards are way past the "fast enough" line, but having 64gb on them would be really beneficial.

It's love to see some experiments / different SKUs in this area, given people are already diy-ing extra memory on NVIDIA. (https://hackaday.com/2021/01/29/add-an-extra-8gb-of-vram-to-... there were stable experiments later on, but I don't have a link now)



Graphics card manufacturers believe that selling high-memory consumer graphics cards will affect the market for commercial computing cards, so they will not do so, that's all.


Problem is, making a board design using an existing GPU chip and sticking more RAM into it is (relatively) simple but of course none of the GPU chip makers would allow partners to do that. Making your own GPU chip that’s competitive with Nvidia or AMD’s current offerings is a massive undertaking and pretty much impossible for a newcomer.

Just look at how much trouble Intel has had breaking into the discrete GPU market or even just how hard it’s been for AMD to compete with Nvidia even with decades of experience in the market.

And if some newcomer could make a competitive GPU with large memory capacity they’d be crazy not to sell it at datacenter prices, maybe just undercutting the others but a few grand but still way more expensive than any consumer GPU you can buy today, even a 4090.



For data rates, as in bandwidth per IO pin, distance is really only a secondary factor. HBM memory, for example, runs at substantially lower data rates than GDDR, yet it sits right next to the GPU die compared to centimeters for the GDDR. And high-speed serial links run at speeds that are an order of magnitude higher than even the internal register files of a CPU.


Check out PCB back drilling. It's a process where you remove a few hundred microns from the vias that are used to connect GDDR RAMs to the GPUs, to avoid reflections due to the impedance mismatch that's caused by the stub.

When you have a pulse coded signal traveling at close to 10GHz, everything becomes an antenna. The technical problem is that you can't do this with a flimsy connector like the ones used for DIMMs. The reason GDDR can have a bandwidth per pin that is 4 times higher than regular DDR is because they are soldered down on the PCB.



> We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?

I imagine it would incur a non trivial latency and cost penalty. The memory modules are placed pretty close to the compute die right now. Cooling would also have to change (the memory modules produce a lot of heat).

But there is also no reason for any of the GPU manufacturers to do this. A skew with twice as much memory can go for a lot more than the difference in memory cost alone



I don't disagree but (I know nothing about this btw...) would it not benefit in terms of, say, a L3 cache kind of thing?

Imagine you could stick 2 x 64GB DDR5 DIMMS on the GPU in sockets, would that not be faster to access than the motherboard DIMMS? It won't be as fast as on-die memory of course but could it not act like a sort of halfway house?



And especially doing "interesting" combinations of gpu and memory.

Like lower end gpu with 16 GB of VRAM, but offering just 8 / 12 GB of VRAM in the middle class and then again 16 GB in the upper class of gpu selection.



Karpathy's code, teaching and contribution to the body of knowledge in this area really is admirable.

Sadly I am a generalist, but if I were a specialist, I would hope to contribute as openly and widely as Karpathy.

Not clout chasing, click-bait, "top 5 javascript frameworks of 2023!" ... just high quality output that marks a specialist.

Sorry to gush.



Question, apologize if slightly off-topic, it's something I'd like to use this project for: Is there an example of how to train GPT-2 on time series, in particular with covariates?

As my understanding of LLM goes at a basic level it's predicting the next token from previous tokens, which sounds directionally similar to time series (perhaps letting aside periodicity).



When Lex recently talked to Andre, Andre said that he gets positively obsessed with a problem and says "this must exist". I imagine this must be one of those outputs.


Wow, and this is done after a recent trip to Bhutan to clear his head! I follow karpathy on twitter and he posted that 2 weeks without constantly looking and checking his phone kind of turns off the constantly on radio in his head.


This is an implementation of a transformer and in README it's presented as text->text. Tokens are just integers going in and out.

Is it possible to use it to train other types of LLMs(text->image, image->text, speech->text, etc.)?



The transformer itself just takes arrays of numbers and turns them into arrays of numbers. What you are interested in is the process that happens before and after the transformer.


Or rather GLSL... The C++ code looks like it's mostly just scaffolding to kick off the actually important GPU work, and for that it's a surprising amount of code. Quite typical both for Vulkan and C++ though ;)


Is this able to replace PyTorch, ... in normal practice? No.

Does this show that in general the most used ML frameworks are a mess? Yes.



> Does this show that in general the most used ML frameworks are a mess? Yes.

Not really ... there is little to no overlap with what a framework like PyTorch does. There is no tensor class, no autograd, etc. Just malloc, a bunch of hand calculated pointers into that chunk of memory, and hand written gradient functions. I assume the intent here is to be educational by stripping away the layers of abstraction to make it clearer what is going on.

Frankly though, this code (all that pointer math!) is a mess too, maybe written this way to make it easy to port to cuDNN which is at a similarly low level (other than having tensor descriptors which make the memory layout more flexible).

If you want to write your own tensor class and reusable NN framework, then the lines of code go up very rapidly. I did one in C++ a while back, and the tensor class alone was 20K LOC.



This post is about training not inference. And llama.cpp has similarly simple LoRa training code. There is nothing in neural networks themselves so complex to justify the amount of complexity the Python-ML community piled up. MLX, for instance, is a similarly general purpose research framework that is a fraction of the size.


It would be great if someone created a tutorial around this explaining exactly how it works and how to do a test training run. I’m aware it’s not feasible to train a “real” model on personal hardware but it would be nice to have a practical learning experience. I’m not sure if there are good alternatives for that.


The author has a whole series where he does exactly that. YouTube videos, code examples, documentation, everything. Explains the math, explains how to code it, explains the architecture. Everything.


If I was starting from scratch, what resources should I start with to build up an understanding of what this code does and how to read it? It's quite dense and my knowledge of LLMs is quite minimal. Are these terse variable names standard in LLM-land?


Terse variables are a C thing.

“What resources would I need” -> you’re literally commenting on a teachers content. Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything. He has a ton of repos and tutorials. Dig a little.

If all else fails… Google it.



> you’re literally commenting on a teachers content.

How am I supposed to know that?

> Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything.

Or that, without knowing that he's a teacher?

> Terse variables are a C thing.

I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

> If all else fails… Google it.

There's a lot of LLM garbage out there. I got an answer here in a few minutes pointing to Karpathy's course which seems very high quality.

Be kinder.



> How am I supposed to know that?

You’re not supposed to know that. You asked a question, and this is you being told the answer.

It’s very convenient that the author of the post is quite literally the world’s most prolific teacher on this topic. Makes it easy to find Karpathy. You shouldn’t be expected to otherwise know that (or else why ask if you knew).

> I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

This feels like a joke but old C compilers did have variable length limits. This is part of why C historically had shorter variables than other more modern languages.

Sorry if it came off rude, the internet is hard to communicate over.

https://publications.gbdirect.co.uk/c_book/chapter2/keywords...



As siblings have said, his video series are quite good. But if you're just looking at this repo only, you probably want to look at the python reference implementation. (The C is designed to exactly replicate its functionality.)


On one hand, really nice to see the whole thing in 1000 lines of C code.

On the other hand, that malloc function low key terrifies me. :)



Better to be explicit than hiding unsafe memory accesses under C++ stdlib classes like std::vector which don't do range checking either in operator[]. And in this sort of code, automatically injected runtime range checks would most likely hurt performance enough to matter.

I would still run the code through the Clang static analyzer and a couple of test runs in ASAN and UBSAN to be sure that nothing slipped through.



OT but question from someone curious..... is Cuda still entrenched as the only option for doing AI or is there growing support for AMD/Intel/Other ways of doing AI?


You can run inference today on pretty much any card.

Download Ollama on a modern MacBook and can run 13B and even higher (if your RAM allows) at fast speeds. People run smaller models locally on their phones

Google has trained their latest models on their own TPUs... not using Nvidia to my knowledge.

So, no, there are alternatives. CUDA has the largest mindshare on the training side though.



He loudly gave up on AMD after they did not fix a blocker he had for 5+ months and gave him the runaround the entire time when he asked for the code to fix it himself. He is still shipping the AMD tinybox with huge warning labels.


Randomly stumbled over this[1] post with another fed up open source contributor, due to several serious issues with AMDs GPU drivers and firmware that remain unresolved for years. It also references the geohot decision you mention.

Some quotes:

I find it incredible that these companies that have large support contracts with you and have invested hundreds of thousands of dollars into your products, have been forced to turn to me, a mostly unknown self-employed hacker with very limited resources to try to work around these bugs (design faults?) in your hardware.

In the VFIO space we no longer recommend AMD GPUs at all, in every instance where people ask for which GPU to use for their new build, the advise is to use NVidia.

[1]: https://www.reddit.com/r/Amd/comments/1bsjm5a/letter_to_amd_...



there are obv alternatives from both intel and amd, performant blas/dnn packages, but small teams don’t use them bc cuda is easier to use and has more support, and larger teams don’t use them bc they have deals w/ nvidia or not enough GPUs are available or they’re after the absolute best performance (which is still nvidia) or bc of other stuff like unstable drivers or smth


Taking a peek inside the package it seems to mostly be the libraries - CuFFT alone is about 350MB for example, twice over for the debug and release versions. I'm guessing those are probably fat binaries pre-compiled for every generation of Nvidia hardware rather than just the PTX bytecode, which would help to speed up fresh builds, at the expense of being huge.


I don't think it's about the byte size, but the inherent complexity of the implementation. 1000 lines of C code is extremely simple by any standard. Whereas a sundry collection of Python and PyTorch libraries is anything but.


Python has been popular for this because it’s convenient to quickly hack on and experiment with, not because it’s the most efficient thing.


The overhead really isn't that bad is it? Since the the python code is mostly about saying multiply matrix A with matrix B, and then that actual computation is done by optimized low level code.


It depends on how you define overhead. Runtime overhead and memory usage is absolutely marginal, and the tightest, most perfect implementation will have trouble beating it.

Instead people are trying to optimize install size of dependencies, which while maybe a fun hacking project...who really cares?



For that stuff, yeah you're correct.

What I've seen is issues with the implementation of those libraries in a project.

I don't remember exactly, but I was playing with someone's wrapper for some kind of machine learning snake game and it was taking way longer than it should have on back of the napkin math.

The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

So it's easy to think "yeah this library is optimized" but then you build something on top of it that is not obviously going to slow it down.

But, that's the Python tradeoff.



> The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

The programmer using the wrong data structure is not a problem with the language.



Kinda. I guess my native tongue is C/C++ and I wouldn't expect such a huge performance difference when using an array vs a linked list or something.

It's not like I had millions of items in that structure either, it was like 100. I think it contained the batch training data from each round. I tried to find the project but couldn't.

I was just shocked that there was such a huge difference between primitive data structures. In that situation, I wouldn't have guessed it would make a difference.



But then again if your program have places where choosing the right Python primitive is important for performance, then using python is affecting performance here since even the best algorithm in Python would be slower than the equivalent C.

Most of the time it doesn't matter because there's nothing hoy on the Python side, but if there is, then Python is going to be slowing your stuff down.



I suspect that this has a high chance of running afoul of Ahmdal’s Law. Even if you can parallelise the bulk of the computation, the serial parts remain single-threaded and start to dominate the total runtime.


But in this particular application, if you look e.g. through the training code, there's very little going on in terms of resource management. A handful of mallocs and some file handling.

That doesn't mean that you can have bugs even in a small number of instances, but if automatic resource management is your main argument, perhaps llm.c isn't the most prominent case that could benefit from it.



RAII sucks the way C++ does it. Magical background BS with massive unintended complexity consequences requiring obtuse intricate crap like the copy-and-swap idiom, a mudball of pointer and reference types, etc. They should have added a defer/scope-exit statement and been done with it.


The plan is to eventually implement with CUDA:

"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."

联系我们 contact @ memedata.com