Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

Virk, Yuvraj; Devanbu, Premkumar; Ahmed, Toufique

Full-text links:

Download:

Current browse context:

cs.SE

< prev | next >

new | recent | 2404

Computer Science > Software Engineering

Title: Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

Authors: Yuvraj Virk, Premkumar Devanbu, Toufique Ahmed

(Submitted on 30 Apr 2024)

Abstract: A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
However, LLMs often err and generate something quite unlike what a human might say. Given an LLM-produced code summary, is there a way to gauge whether it's likely to be sufficiently similar to a human produced summary, or not? In this paper, we study this question, as a calibration problem: given a summary from an LLM, can we compute a confidence measure, which is a good indication of whether the summary is sufficiently similar to what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. We suggest an approach which provides well-calibrated predictions of likelihood of similarity to human summaries.

Subjects:	Software Engineering (cs.SE); Computation and Language (cs.CL)
Cite as:	arXiv:2404.19318 [cs.SE]
	(or arXiv:2404.19318v1 [cs.SE] for this version)

Submission history

From: Yuvraj Virk [view email]
[v1] Tue, 30 Apr 2024 07:38:08 GMT (2680kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.19318

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Software Engineering

Title: Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

Submission history