PrOnto: Language Model Evaluations for 859 Languages

Gessler, Luke

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2305

Change to browse by:

Computer Science > Computation and Language

Title: PrOnto: Language Model Evaluations for 859 Languages

Authors: Luke Gessler

(Submitted on 22 May 2023 (v1), last revised 28 Mar 2024 (this version, v2))

Abstract: Evaluation datasets are critical resources for measuring the quality of pretrained language models. However, due to the high cost of dataset annotation, these resources are scarce for most languages other than English, making it difficult to assess the quality of language models. In this work, we present a new method for evaluation dataset construction which enables any language with a New Testament translation to receive a suite of evaluation datasets suitable for pretrained language model evaluation. The method critically involves aligning verses with those in the New Testament portion of English OntoNotes, and then projecting annotations from English to the target language, with no manual annotation required. We apply this method to 1051 New Testament translations in 859 and make them publicly available. Additionally, we conduct experiments which demonstrate the efficacy of our method for creating evaluation tasks which can assess language model quality.

Comments:	Accepted at LREC-COLING 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.12612 [cs.CL]
	(or arXiv:2305.12612v2 [cs.CL] for this version)

Submission history

From: Luke Gessler [view email]
[v1] Mon, 22 May 2023 00:33:52 GMT (7579kb,D)
[v2] Thu, 28 Mar 2024 14:23:08 GMT (1736kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2305.12612

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: PrOnto: Language Model Evaluations for 859 Languages

Submission history