We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Authors: Juri Opitz
Abstract: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a 'macro' metric. This is problematic, since picking a metric can affect paper findings as well as shared task rankings, and thus any clarity in the process should be maximized.
Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics, considering expectations as found expressed in papers. Equipped with a thorough understanding of the metrics, we survey metric selection in recent shared tasks of Natural Language Processing. The results show that metric choices are often not supported with convincing arguments, an issue that can make any ranking seem arbitrary. This work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.
Comments: to appear in TACL, this is a pre-MIT Press publication version
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as: arXiv:2404.16958 [cs.LG]
  (or arXiv:2404.16958v1 [cs.LG] for this version)

Submission history

From: Juri Opitz [view email]
[v1] Thu, 25 Apr 2024 18:12:43 GMT (404kb,D)

Link back to: arXiv, form interface, contact.