14th International Workshop on Computer Science and Engineering, WCSE 2024, Phuket Island, Thailand, 19 - 21 June 2024, pp.290-296, (Full Text)
Large Language Models (LLMs) have garnered significant attention for their diverse capabilities across various applications. This study delves into their potential for code generation, a task witnessing the emergence of specialized LLMs. However, evaluating these models remains a challenge. This work proposes a novel approach to LLM evaluation in code generation. We leverage a dataset of real-world programming tasks culled from university student homework assignments, offering a window into practical programming experiences. We assess recent LLM models using informal natural language prompts formulated by non-native English speakers, reflecting the variety of user inputs encountered in practice. Our findings reveal that while these LLMs demonstrate promise, their success rates in solving problems on the first attempt remain modest, especially for more complex tasks. This highlights the need for continued research in fine-tuning LLMs for code generation. Overall, this study contributes to the field by introducing a unique evaluation methodology employing real-world problems and natural language prompts, offering valuable insights into the current capabilities and limitations of LLMs in code generation.