Let's Build the GPT Tokenizer [video]

mikewarot · 2024-02-21T10:23:40

No wonder GPT does so horribly on anything involving spelling, or the exact specifications of letters.

To fix it, I'd throw a few gigabytes of synthetic data in the training mix before fine tuning that included the alphabets of all the relevant languages, things like.

  A is an upper case a
  a is a lower case A
  the sequence of numbers is 0 1 2 3 4 5 6 7 8 9 10 11 12
  0 + 1 = 1
  1 + 1 = 2

etc.

It still amazes me that Word2Vec is as useful as it is, let alone LLMs. The structure inherent in language really does convey far more meaning that we assume. We're like fish, not being aware of water, when we use language.

sebzim4500 · 2024-02-21T10:26:01

We know OpenAI trains on significant amounts of synthetic data, they probably have something like this.

threesevenths · 2024-02-20T18:18:46

Andrej's video on building GPT nano is an excellent tutorial of all of the steps involved in a modern LLM.

xdennis · 2024-02-20T21:32:32

https://www.youtube.com/watch?v=kCc8FmEb1nY

pests · 2024-02-21T03:43:33

His earlier videos on micrograd and makemore are a gold mine as well.

seanbethard · 2024-02-21T09:56:46

It’s pretty wild how little discussion there's been about the core feature of these models. It's as if this aspect of their development has been solved. Basically all NLP publications today take these BPE tokens as a starting point and if they are mentioned at all they’re mentioned in passing.

https://blog.seanbethard.net/meanings-are-tiktokens-in-space

unknown2342 · 2024-02-21T10:35:39

very grateful that he puts out this kind of education. The one bit I have is that he didn’t explained all the abstract question and the beginning which leads into bad taste I guess. I hope I am not disrespecting.

mrtksn · 2024-02-20T18:37:44

I can't recommend enough the whole series, zero to hero: https://karpathy.ai/zero-to-hero.html

No metaphors trying explain "complex" ideas, making them scary and seem overly complex. Instead, hands on implementations with analogy explainers where you can actually understand the ideas and see how simple it is.

Steeper learning curve at first but it is much more satisfying and you actually earn the ability to reason about this stuff instead of writing over the top influencer BS.

yen223 · 2024-02-21T00:02:59

One thing I like about that zero-to-hero series is how he almost never handwave over seemingly minor details.

Definitely recommend watching those videos and doing the exercises, if you have any interest in how LLMs work.

MPSimmons · 2024-02-20T19:05:43

Thanks for this link - I have some free time coming up, and this seems like a great use of it!

mynameisure · 2024-02-21T05:36:39

A noob question? Do you all intend to work on LLM’s or watching the content for the curious mind.I am asking how anyone like me as a software generalist can make use of this amazing content.Anyone with insights on how to transition from a generalist backend engineer to an AI engineer ? Or its a niche and the only path is the route of PHD …

lb4r · 2024-02-21T07:42:54

Speaking for myself, and except for just being curious, it's mostly for similar reasons as to why you'd want to read, for example, CLRS, even though you'll probably never implement an algorithm like that in a real production environment yourself. It's not so much about learning how, but rather why, because it'll help you answer your why's in the future (not that the how can't also be important, of course).

brainless · 2024-02-21T07:55:25

I was not really interested in LLMs till a month back. I had an earlier product where I wanted a no-code app for business insights on any data source. Plug in MySQL, PostgreSQL, APIs like Stripe, Salesforce, Shopify, even CSV files and it would be able to generate queries from user's GUI interactions. Like Airtable but for own data sources. I was generating SQLs including JOINs, or HTTPS API calls.

Then I abandoned it in 2021. This year, it struck me that LLMs would be great to infer business insights from the schema. I could create reports and dashboards automatically, surface critical action points straight from the schema/data and users chatting with the app.

So for the last couple weeks, I have been building it, running test on LLMs (CodeLlama, Zephyr, Mistral, Llama 2, Claude and ChatGPT). The results are quite good. There is a lot of tech that I need to handle: schema analysis, SQL or API calls, and the whole UI. But without LLMs, there was no clear way for me to infer business insights from schema + user chats.

To me, this is not a niche anymore now that I have found a problem I wanted to tackle already.

elbear · 2024-02-21T06:57:54

Just a guess, but understanding how LLMs are built may also help you if you want to fine-tune a model. Someone who knows more may confirm or contradict this.

timzaman · 2024-02-21T05:48:44

The best thing is that I know Andrej reads all these comments. Hi Andrej! This is your calling. Miss you though!

theptrk · 2024-02-21T04:28:38

There should be awards for this type of content. Andrew Ng series and Karpathy series as first inductees to the hall of fame.

albert_e · 2024-02-21T04:54:43

His video on Backpropagation was a revelation to me :

https://www.youtube.com/watch?v=q8SA3rM6ckI

098799 · 2024-02-21T08:28:44

Probably more coming soon given he just left openai to pursue other things.

sabareesh · 2024-02-20T18:22:06

Even if you pay it is hard to get such a high quality content!

progbits · 2024-02-20T19:12:51

I've been learning a few new CS things recently and honestly I mostly find inverse correlation between cost and quality.

There are books from oreilly and paid MOOC courses that are just padded with lots of unnecessary text or silly "concept definition" quizzes to make them seem worth the price.

And there are excellent free YT video lectures, free books or blog posts.

Andrej's YT videos are one great example. https://course.fast.ai is another.

GaneshSuriya · 2024-02-20T20:26:04

It's not only about the cost, though. There's an inverse correlation with the glossiness of the content as well.

If the web page /content is too polished, they're most likely optimizing for wooing users.

Unlike a lot of the examples I gave in the sibling comments. Where the optimization is only on the love for the topic being discussed

rahimnathwani · 2024-02-20T21:30:22

  There's an inverse correlation with the glossiness of the content as well.

This is probably due to survivorship bias. Sites that have poor content and poor visual appeal (glossiness) never get on your radar.

i.e. Berkson's Paradox: https://en.wikipedia.org/wiki/Berkson%27s_paradox

lynx23 · 2024-02-21T10:24:40

Full ACK. I have also grown weary of payed course offerings, because many I have checked out were basically low quality or shallow.

danielmarkbruce · 2024-02-20T21:27:46

There are some extremely good CS textbooks which cost money. That being said, many good ML/AI texts are free. But it's not easy reading.

thfuran · 2024-02-20T20:13:24

>And there are excellent free YT video lectures, free books or blog posts.

There's also a tremendous amount of extremely low quality YouTube and blog content.

progbits · 2024-02-20T20:40:58

Sure. I don't claim the free content is all good.

But from my limited sample size, the best free content is better than the best paid content.

simmanian · 2024-02-20T19:38:32

Do you have recommendations for other high quality courses teaching CS things?

GaneshSuriya · 2024-02-20T20:19:38

- operating system in three easy pieces (https://pages.cs.wisc.edu/~remzi/OSTEP) is incredible for learning OS internals

- beej's networking guide is the best thing for network layer stuff https://beej.us/guide/

- explained from first principles great too https://explained-from-first-principles.com/

- pintos from Stanford https://web.stanford.edu/class/cs140/projects/pintos/pintos_...

vb234 · 2024-02-20T23:43:59

Wow. Thanks for sharing. I had no idea that Professor Remzi and his wife Andrea wrote a book on Operating Systems. I loved his class (took it almost 22 years ago.) Will have to check his book out.

davidbarker · 2024-02-20T20:10:01

I can highly recommend CS50 from Harvard (https://www.youtube.com/@cs50). Even after being involved in tech for 25+ years, I learnt a lot from just the first lecture alone.

Disclosure: Professor Malan is a friend of mine, but I was a fan of CS50 long before that!

codelobe · 2024-02-20T23:28:35

Replying to bookmark(hoard) all the thread links later.

Fellow hackers might also enjoy:

https://www.nand2tetris.org/

diimdeep · 2024-02-20T21:16:21

Build an 8-bit computer from scratch https://eater.net/8bit/ https://www.youtube.com/playlist?list=PLowKtXNTBypGqImE405J2...

Andreas Kling. OS hacking: Making the system boot with 256MB RAM https://www.youtube.com/watch?v=rapB5s0W5uk

MIT 6.006 Introduction to Algorithms, Spring 2020 https://www.youtube.com/playlist?list=PLUl4u3cNGP63EdVPNLG3T...

MIT 6.824: Distributed Systems https://www.youtube.com/@6.824

MIT 6.172 Performance Engineering of Software Systems, Fall 2018 https://www.youtube.com/playlist?list=PLUl4u3cNGP63VIBQVWguX...

CalTech cs124 Operating Systems https://duckduckgo.com/?t=ffab&q=caltech+cs124&ia=web

try searching here at HN for recommendations https://hn.algolia.com

nojvek · 2024-02-20T22:34:22

Thank you a ton for the links.

Tomte · 2024-02-21T06:30:46

nand2tetris: https://www.nand2tetris.org/

I like the book better than the online course.

3abiton · 2024-02-20T19:13:10

His previous video onLLM tramsformer foundation is extremely useful.

ShamelessC · 2024-02-21T01:07:59

Had to double check my playback speed - he talks like a 1.25x playback speaker sounds.

coolThingsFirst · 2024-02-21T10:48:11

let him teach we don't need one more rapper

moffkalast · 2024-02-20T22:18:32

> you see when it's a space egg, it's a single token

I'm not sure if the crew of the Nostromo would agree ;)

（评论） (comments)

（评论）
(comments)