Unsloth introduces reasoning capabilities in their platform using Group Relative Policy Optimization. GRPO allows users to transform standard models into reasoning models locally with as little as 7GB VRAM. Previously, GRPO was only supported for full fine-tuning, but now it works with QLoRA and LoRA.

It optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization. Use cases for GRPO include creating customized models with rewards or generating reasoning processes for input-output data.