PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Yadav, Ankit; Singh, Mayank

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2401

Computer Science > Computation and Language

Title: PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Authors: Ankit Yadav, Mayank Singh

(Submitted on 8 Jan 2024 (v1), last revised 26 Apr 2024 (this version, v3))

Abstract: Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.03855 [cs.CL]
	(or arXiv:2401.03855v3 [cs.CL] for this version)

Submission history

From: Ankit Yadav [view email]
[v1] Mon, 8 Jan 2024 12:36:43 GMT (9603kb,D)
[v2] Fri, 23 Feb 2024 04:29:06 GMT (11277kb,D)
[v3] Fri, 26 Apr 2024 04:53:51 GMT (11277kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2401.03855

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM

Submission history