References & Citations
Computer Science > Computation and Language
Title: PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM
(Submitted on 8 Jan 2024 (v1), last revised 26 Apr 2024 (this version, v3))
Abstract: Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
Submission history
From: Ankit Yadav [view email][v1] Mon, 8 Jan 2024 12:36:43 GMT (9603kb,D)
[v2] Fri, 23 Feb 2024 04:29:06 GMT (11277kb,D)
[v3] Fri, 26 Apr 2024 04:53:51 GMT (11277kb,D)
Link back to: arXiv, form interface, contact.