We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Abstract: Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2401.03855 [cs.CL]
  (or arXiv:2401.03855v3 [cs.CL] for this version)

Submission history

From: Ankit Yadav [view email]
[v1] Mon, 8 Jan 2024 12:36:43 GMT (9603kb,D)
[v2] Fri, 23 Feb 2024 04:29:06 GMT (11277kb,D)
[v3] Fri, 26 Apr 2024 04:53:51 GMT (11277kb,D)

Link back to: arXiv, form interface, contact.