We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Modeling Orthographic Variation in Occitan's Dialects

Authors: Zachary William Hopton (Language and Space Lab, University of Zurich), Noëmi Aepli (Department of Computational Linguistics, University of Zurich)
Abstract: Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
Comments: Accepted at VarDial 2024: The Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2404.19315 [cs.CL]
  (or arXiv:2404.19315v1 [cs.CL] for this version)

Submission history

From: Zachary William Hopton [view email]
[v1] Tue, 30 Apr 2024 07:33:51 GMT (1543kb,D)

Link back to: arXiv, form interface, contact.