Repository-Level Code Understanding by LLMs via Hierarchical Summarization: Improving Code Search and Bug Localization


Oskooei A. R., Yukcu S., Bozoglan M. C., AKTAŞ M. S.

Workshops of the International Conference on Computational Science and Its Applications, ICCSA 2025, İstanbul, Türkiye, 30 Haziran - 03 Temmuz 2025, cilt.15886 LNCS, ss.88-105, (Tam Metin Bildiri) identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 15886 LNCS
  • Doi Numarası: 10.1007/978-3-031-97576-9_6
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.88-105
  • Anahtar Kelimeler: Applied Machine Learning, Automatic Program Repair, Defect Detection, Large Language Models (LLMs), Semantic Code Search, Software Engineering
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Bug localization and semantic code search within large software repositories is a significant and time-consuming challenge for developers, particularly when dealing with bug reports from end-users who lack technical expertise. Traditional similarity-based code search methods struggle with the inherent domain and vocabulary mismatch between end-user reports and codebase semantics, while directly applying Large Language Models (LLMs) is hampered by their limited context windows and lack of repository-level understanding. To address these limitations, this paper introduces a novel, structure-aware methodology for creating repository-aware LLMs using hierarchical summarization. Our approach comprises a pre-processing phase that constructs an abstract repository tree, creates a context-aware LLM primed with project knowledge, and generates hierarchical summaries at project, directory, and file levels. The inference phase employs a top-down search strategy, guiding the LLM to progressively narrow down the search space from directory-level to file-level, effectively localizing bug-relevant code. This method mitigates the context window bottleneck and leverages LLMs’ semantic understanding to overcome domain gap issues. Evaluated on a real-world dataset of Jira issues from a large-scale industrial project, our approach significantly outperforms both Flat Retrieval baselines and state-of-the-art LLM + Retrieval-Augmented Generation (RAG) systems, achieving a Pass@10 of 0.89 and Recall@10 of 0.33. The results demonstrate the efficacy of hierarchical summarization in enabling scalable, task-agnostic, and structure-aware repository-level code comprehension for improved bug localization and code search, particularly in scenarios involving non-technical end-user bug reports.