10th International Conference on Computer Science and Engineering, UBMK 2025, İstanbul, Türkiye, 17 - 21 Eylül 2025, ss.645-650, (Tam Metin Bildiri)
Enhancing the reasoning capabilities of Large Language Models (LLMs) remains a core AI challenge. This work examines how Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), affect LLM performance on logic-based tasks. We train Llama, 3 (3B, 8B) and Qwen-2.5 (3B) using datasets enriched with synthetically generated, Chain-Of-Thought (CoT) traces, including both single-solution and multi-solution variants, denotation-verified trial-and-error traces, and self-generated reasoning data, all capturing intermediate reasoning steps. We also explore GRPO using reward functions with varying restrictiveness, as well as combinations of these approaches. Evaluated on the 24 and Countdown Games, as well as GSM8K, several approaches surpass standard Chain-Of-Thought methods. Notably GRPO with reasoning-rich SFT achieves 71.6% accuracy on the 24 Game, well above GPT-4o's 54%, and boosts GSM8K scores by 4-10 points over baseline SFT. Results show that explicitly modeling reasoning not only enhances problem-solving but also enables smaller models to outperform much larger and capable models on specific logic and reasoning tasks when trained appropriately.