ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Brassard, Ana; Heinzerling, Benjamin; Kudo, Keito; Sakaguchi, Keisuke; Inui, Kentaro

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computation and Language

Title: ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Authors: Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

(Submitted on 8 May 2024)

Abstract: Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: this https URL

Comments:	18 pages, 7 figures, under review. Data available here: this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.04818 [cs.CL]
	(or arXiv:2405.04818v1 [cs.CL] for this version)

Submission history

From: Ana Brassard [view email]
[v1] Wed, 8 May 2024 05:36:52 GMT (1236kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.04818

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Submission history