We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: A diverse Multilingual News Headlines Dataset from around the World

Abstract: Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the \emph{event signatures} of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on \href{this https URL}{Kaggle} and \href{this https URL}{HuggingFace} with accompanying \href{this https URL}{GitHub} code.
Comments: Published in NAACL 2024 Proceedings (Short Paper track)
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2403.19352 [cs.CL]
  (or arXiv:2403.19352v1 [cs.CL] for this version)

Submission history

From: Felix Leeb [view email]
[v1] Thu, 28 Mar 2024 12:08:39 GMT (2795kb,D)

Link back to: arXiv, form interface, contact.