A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Opitz, Juri

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2404

Computer Science > Machine Learning

Title: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Authors: Juri Opitz

(Submitted on 25 Apr 2024)

Abstract: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a 'macro' metric. This is problematic, since picking a metric can affect paper findings as well as shared task rankings, and thus any clarity in the process should be maximized.
Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics, considering expectations as found expressed in papers. Equipped with a thorough understanding of the metrics, we survey metric selection in recent shared tasks of Natural Language Processing. The results show that metric choices are often not supported with convincing arguments, an issue that can make any ranking seem arbitrary. This work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

Comments:	to appear in TACL, this is a pre-MIT Press publication version
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2404.16958 [cs.LG]
	(or arXiv:2404.16958v1 [cs.LG] for this version)

Submission history

From: Juri Opitz [view email]
[v1] Thu, 25 Apr 2024 18:12:43 GMT (404kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.16958

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Submission history