Applied Sciences (Switzerland), cilt.16, sa.4, 2026 (SCI-Expanded, Scopus)
Manual code review is essential for software quality, but often slows down development cycles due to the high time demands on developers. In this study, we propose an automated solution for Python (version 3.13) projects that generates code review comments by combining Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). To achieve this, we first curated a dataset from GitHub pull requests (PRs) using the GitHub REST Application Programming Interface (API) (version 2022-11-28) and classified comments into semantic categories using a semi-supervised Support Vector Machine (SVM) model. During the review process, our system uses a vector database to retrieve the top-k most relevant historical comments, providing context for a diverse spectrum of open-weights LLMs, including DeepSeek-Coder-33B, Qwen2.5-Coder-32B, Codestral-22B, CodeLlama-13B, Mistral-Instruct-7B, and Phi-3-Mini. We evaluated the system using a multi-step validation that combined standard metrics (BLEU-4, ROUGE-L, cosine similarity) with an LLM-as-a-Judge approach, and verified the results through targeted human review to ensure consistency with expert standards. The findings show that retrieval augmentation improves feedback relevance for larger models, with DeepSeek-Coder’s alignment score increasing by 17.9% at a retrieval depth of k = 3. In contrast, smaller models such as Phi-3-Mini suffered from context collapse, where too much context reduced accuracy. To manage this trade-off, we built a hybrid expert system that routes each task to the most suitable model. Our results indicate that the proposed approach improved performance by 13.2% compared to the zero-shot baseline (k = 0). In addition, our proposed system reduces hallucinations and generates comments that closely align with the standards expected from the experts.