Supervised fine-tuning teaches a model from example outputs. Reinforcement learning (RL) teaches from *rewards* -- the model generates its own outputs, and a reward function scores them. The model ...
In tutorial 04, we wrote a GRPO training loop from scratch: sample completions, grade them, compute advantages, build datums, train. That works, but every new task would repeat the same boilerplate.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results