（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40269489

该用户分享了他们自学深度强化学习 (DRL) 并创建实践教程来补充现有资源的旅程。尽管利用了许多有用的资源 [1]，但他们发现没有一个能够完美平衡理论理解和实际应用以满足他们的需求。用户从头开始编写新代码，重点关注教学法 [2]。他们的项目名为“深度强化学习：从头开始”，涵盖了领先的 DRL 方法，例如 Q-Learning、Deep Q-Network、Soft Actor Critic 和 Proximal Policy Optimization。他们的目标是提供一个结合理论和编码练习的引人入胜的连续学习过程。受到 Andrej Karpathy 的“神经网络：从零到英雄”[2] 的启发，用户计划开发随附的 YouTube 视频，但目前缺乏制作时间。 [1] 包含介绍性材料的 GitHub 存储库：https://github.com/username/dlr-from-scratch/blob/master/00_Introduction.ipynb [2]神经网络零到英雄网页：https://karpathy.ai/zero-to-hero.html 用户表示在大学里广泛学习了 RL 后却因先进的 DRL 技术无法超越特定任务中的基本启发式方法而感到失望。然而，他们强调了在公司内部成功实施强化学习原则，从而节省了成本并揭示了有价值的特征见解。他们认为强化学习提供了一个独特的视角，代表了一个与其环境交互并接收反馈的智能代理——一种对复杂问题进行建模的直观方式。用户承认让强化学习有效发挥作用所面临的挑战，同时敦促初学者学习基础知识，将其作为机器学习之旅中宝贵的垫脚石。此外，他们还强调了 DLRMFineTune 和采用 RL 的机器人项目的最新进展。

While trying to learn the latest in Deep Reinforcement Learning, I was able to take advantage of many excellent resources (see credits [1]), but I couldn't find one that provided the right balance between theory and practice for my personal experience. So I decided to create something myself, and open-source it for the community, in case it might be useful to someone else.

None of that would have been possible without all the resources listed in [1], but I rewrote all algorithms in this series of Python notebooks from scratch, with a "pedagogical approach" in mind. It is a hands-on step-by-step tutorial about Deep Reinforcement Learning techniques (up ~2018/2019 SoTA) guiding through theory and coding exercises on the most utilized algorithms (QLearning, DQN, SAC, PPO, etc.)

I shamelessly stole the title from a hero of mine, Andrej Karpathy, and his "Neural Network: Zero To Hero" [2] work. I also meant to work on a series of YouTube videos, but didn't have the time yet. If this posts gets any type of interest, I might go back to it. Thank you.

P.S.: A friend of mine suggested me to post here, so I followed their advice: this is my first post, I hope it properly abides with the rules of the community.

[1] https://github.com/alessiodm/drl-zh/blob/main/00_Intro.ipynb [2] https://karpathy.ai/zero-to-hero.html

Yes, the material relies heavily on Python. I intentionally used popular open-source libraries (such as Gymnasium for RL environments, and PyTorch for deep learning) and Python itself given their popularity in the field, so that the content and learnings could be readily applicable to real-world projects.

The theory and algorithms per-se are general: they can be re-implemented in any language, as long as there are comparable libraries to use. But the notebooks are primarily in Python, and the (attempted) "frictionless" learning experience would lose a bit if the setup is in a different language, and it'll likely take a little bit more effort to follow along.

Thank you so much! And very good advice: I have an extremely brief and not-descriptive list in the "Next" notebook, initially intended for that. But it definitely falls short.

I may actually expand it in a second "more advanced" series of notebooks, to explore model-based RL, curiosity, and other recent topics: even if not comprehensive, some hands on basic coding exercise on those topics might be of interest nonetheless.

In case you want to expand to more chapters one day: there's lots of tutorials of doing the simple things that has been verified to work, but if I'm struggling it's normally with something people barely ever mention - what to do when things go wrong. For example your actions just consistently get stuck at maximum. Or the exploration doesn't kick in, regardless how noisy you make the off-policy training. Or ...

I wish there were more practical resources for when you've got the basics usually working, but suddenly get issues nobody really talks about. (beyond "just tweak some stuff until it works" anyway)

Thanks a lot, and another great suggestion for improvement. I also found that the common advice is "tweak hyperparameters until you find the right combination". That can definitely help. But usually issues hide in different "corners", both of the problem space and its formulation, the algorithm itself (e.g., just different random seeds have big variance in performance), and more.

As you mentioned, in real applications of DRL things tend to go wrong more often than right: "it doesn't work just yet" [1]. And my short tutorial definitely lacks in the area of troubleshooting, tuning, and "productionisation". If I carve time for expansion, this will likely make top of list. Thanks again.

[1] https://www.alexirpan.com/2018/02/14/rl-hard.html

Thank you for the amazing links as well! You are right that the article [1] is 6 years old now, and indeed the field has evolved. But the algorithms and techniques I share in the GitHub repo are the "classic" ones (dating back then too), for which that post is still relevant - at least from an historical perspective.

You bring up a very good point though: more recent advancements and assessments should be linked and/or mentioned in the repo (e.g., in the resources and/or an appendix). I will try to do that sometime.

I spent three semesters in college learning RL only to be massively disappointed in the end after discovering that the latest and greatest RL techniques can’t even beat a simple heuristic in Tetris.

I modeled part of my company's business problem as a MAB problem and saved my company 10% off their biggest cost and, just as important, showcased an automated truth signal that helped us understand what was, and wasn't, working in several of our features. Like all tools, finding the right place to use RL concepts is a big deal. I think one thing that is often missed in a classroom setting is pushing more real world examples of where powerful ideas can be used. Talking about optimal policies is great, but if you don't help people understand where those ideas can be applied then it is just a bunch of fun math. (which is often a good enough reason on its own :)

For those not in the know, "MAB" is short for Multi-Armed Bandit [1], which is a decision-making framework that is often discussed in the broader context of reinforcement learning.

In my limited understanding, MAB problems are simpler than those tackled by Deep Reinforcement Learning (DRL), because typically there is no state involved in bandit problems. However, I have no idea about their scale in practical applications, and would love to know more about said business problem.

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

There are often times when you have n possible providers of service y, each with strengths and weaknesses. If you have some ultimate truth signal (like follow on costs which are linked to quality, which was what I used) then you can model the providers as bandits and use something like UCB1 to choose which to use. If you then apply this to every individual customer what you end up doing is learning the optimal vendor for each customer which gives you a higher efficiency than had you picked just one 'best all around' vendor for all customers. So the pattern here is: If you have n_service_providers and n_customers and a value signal to optimize then maybe MAB is the place to go for some possible quick gains. Of course if you have a huge state space to explore instead of just n_service_providers, for instance you want to model combinations of choices, using something like a NN to learn the state space value function is also a great way to go.

RL seems to be in this weird middle ground right now where nobody knows how to make it work all that well but almost everybody at the top levels of ML research agrees it's a vital component of further advances in AI.

RL can be massively disappointing, indeed. And I agree with you (and with the amazing post I already referenced [1]) that it is hard to get it to work at all. Sorry to hear you have been disappointed so much!

Nonetheless, I would personally recommend even just learning the basics and fundamentals of RL. Beyond supervised, unsupervised, and the most-recent and well-deservedly hyped semi-supervised learning (generative AI, LLMs, and so on), reinforcement learning indeed models the learning problem in a very elegant way: an agent interacting with an environment and getting feedback. Which is, arguably, a very intuitive and natural way of modeling it. You could consider backward error correction / propagation as an implicit reward signal, but that would be a very limited view.

On a positive note, RL has very practical sucessful applications today - even if in niche fields. For example, LLM fine-tuning techniques like RLHF successfully apply RL to modern AI systems, companies like Covariant are working on large robotics models which definitely use RL, and generally as a research field I believe (but I may be proven wrong!) there is so much more to explore. For example, check Nvidia Eureka that combines LLM to RL [2]: pretty cool stuff IMHO!

Far from attempting to convince you on the strength and capabilities of DRL, just recommending folks to not discard it right away and at least give it a chance to learn the basics, even just for an intellectual exercise :) Thanks again!

[1] https://www.alexirpan.com/2018/02/14/rl-hard.html

[2] https://blogs.nvidia.com/blog/eureka-robotics-research/

This looks really interesting! I tried exploring deep RL myself some time ago but could never get my agents to make any meaningful progress, and as someone with very little stats/ML background it was difficult to debug what was going wrong. Will try following this and seeing what happens!

Thank you very much! I'd be really interested to know if your agents will eventually make progress, and if these notebooks help - even if a tiny bit!

If you just want to see if these algorithm can even work at all, feel free to jump on the `solution` folder and pick any algorithm you think could work and just try it out there. If it does, then you can have all the fun rewriting it from scratch :) Thanks again!

I mean, resources like these are great, but RL in itself is quite dense and topic heavy, so not sure there is any way to reduce the inherent difficulty level, any beginner should be made clear to that. That's my primary gripe with ML topics (especially RL related).

Thank you. It is true, indeed the material does assume some prior knowledge (which I mention in the introduction). In particular: being proficient in Python, or at least in one high-level programming language, be familiar with deep learning and neural networks, and - to get into the theory and mathematics (optional) - basic calculus, algebra, statistics, and probability theory.

Nonetheless, especially for RL foundations, I found that a practical understanding of the algorithms at a basic level, writing them yourself, and "playing" with them and their results (especially in small toy settings like the grid world) provided the best way to start getting a basic intuition in the field. Hence, this resource :)

This is really nice, great idea. I am going to make a suggestion which I hope is helpful - I don't mean to be critical of this nice project.

After going through the MDP example, I have one comment on the way you introduce the non-deterministic transition function. In your example the non-determinism comes from the agent making "mistakes", it can mistakenly go left or right when trying to go up or down:

1) You could introduce the mistakes more clearly as it isn't really explained the agent makes mistakes in the text, and so the comment about mistakes in the transition() function is initally a bit confusing.

2) I think the way this introduces non-determinism could be more didactic if the non-determinism came from the environment, not the agent? For example the agent might be moving on a rough surface and moving its tracks/limbs/whatever might not always produce the intended outcome. As you present it the transition is a function from an action to a random action to a random state, and the definition is just a function from an action to a random state.

Thank you so much for this feedback! Indeed, this is definitely confusing in the notebook. I pushed a small commit to make it a little bit more clear that the non-determinism comes from the probabilistic nature of the environment dynamics (and not b/c the agent chooses a different action by mistake).

As a side note, initially I meant to go through it in a video to fill the gaps in the text with my voice. But given that I didn't have time for those, I am fixing those gaps first :) Thanks again!

Thanks for making this!

Note: I was carefully reading along and well into the third notebook before I realized that the code sections marked "TODO" were actual exercises for the reader to implement! (And the tests which follow are for the reader to check their work.)

This is a clever approach. It just wasn't obvious to me from the outset.

(I thought the TODOs were just some fiddly details you didn't want distracting readers from the big picture. But in fact, those are the important parts.)

Great feedback, I didn't even think about that the TODOs could be indeed confusing! I updated the instructions in the README.md calling them out explicitly as the coding sections to be completed. Thanks again!

Awesome, I've been sort of stuck in the limbo of doing courses that taught me some theory but missing the hands on knowledge I need to really use RL. This looks like exactly the type of course I'm looking for!

Thank you so much! Unfortunately, that is a mistake in the README that I just noticed (thank you for pointing it out!) :( As I mentioned in the first post, I didn't get to make the YouTube videos yet. But it seems the community would be indeed interested.

I will try to get to them (and in the meantime fix the README, sorry about that!)

"Shamelessly stole the title from a hero of mine". Your Shamelessness is all fine. But at first I thought this is a post from Andrej Karpathy. He has one of the best personal brands out there on the internet, while personal brands can't be enforced, this confused me at first.

TL;DR: If more folks feel this way, please upvote this comment: I'll be happy to take down this post, change the title, and either re-post it or just don't - the GitHub repo is out there - that that should be more than enough. Sorry again for the confusion (I just upvoted it).

I am deeply sorry about the confusion. And the last thing I intended was to grab any attention away from Andrej, and / or being confused with him.

I tried to find a way to edit the post title, but I couldn't find one. Is there just a limited time window to do that? If you know how to do it, I'd be happy to edit it right away in case.

I didn't even think this post would get any attention at all - it is my first post indeed here, and I really did it just b/c if anybody could use this project to learn RL I was happy to share.

Throwing in my vote - I wasn’t confused, saw your GH link and a “Zero to Hero” course name on RL, seems clear to me and “Zero to Hero” is a classic title for a first course, nice that you gave props to Andrea too! Multiple people can and should make ML guides and reference each other. Thanks for putting in the time to share your learnings and make a fantastic resource out of it!

Thanks a lot. It makes me feel better to hear that the post is not completely confusing and appropriating - I really didn't mean that, or to use it as a trick for attention.

（评论） (comments)

（评论）
(comments)