Workshops of the International Conference on Computational Science and Its Applications, ICCSA 2025, İstanbul, Türkiye, 30 Haziran - 03 Temmuz 2025, cilt.15886 LNCS, ss.88-105, (Tam Metin Bildiri)
Bug localization and semantic code search within large software repositories is a significant and time-consuming challenge for developers, particularly when dealing with bug reports from end-users who lack technical expertise. Traditional similarity-based code search methods struggle with the inherent domain and vocabulary mismatch between end-user reports and codebase semantics, while directly applying Large Language Models (LLMs) is hampered by their limited context windows and lack of repository-level understanding. To address these limitations, this paper introduces a novel, structure-aware methodology for creating repository-aware LLMs using hierarchical summarization. Our approach comprises a pre-processing phase that constructs an abstract repository tree, creates a context-aware LLM primed with project knowledge, and generates hierarchical summaries at project, directory, and file levels. The inference phase employs a top-down search strategy, guiding the LLM to progressively narrow down the search space from directory-level to file-level, effectively localizing bug-relevant code. This method mitigates the context window bottleneck and leverages LLMs’ semantic understanding to overcome domain gap issues. Evaluated on a real-world dataset of Jira issues from a large-scale industrial project, our approach significantly outperforms both Flat Retrieval baselines and state-of-the-art LLM + Retrieval-Augmented Generation (RAG) systems, achieving a Pass@10 of 0.89 and Recall@10 of 0.33. The results demonstrate the efficacy of hierarchical summarization in enabling scalable, task-agnostic, and structure-aware repository-level code comprehension for improved bug localization and code search, particularly in scenarios involving non-technical end-user bug reports.