TextGram: Towards a better domain-adaptive pretraining

Hiwarkhedkar, Sharayu; Mittal, Saloni; Magdum, Vidula; Dhekane, Omkar; Joshi, Raviraj; Kale, Geetanjali; Ladkat, Arnav

doi:10.1007/978-3-031-58495-4_12

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2404

Computer Science > Computation and Language

Title: TextGram: Towards a better domain-adaptive pretraining

Authors: Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

(Submitted on 28 Apr 2024)

Abstract: For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

Comments:	Accepted at SPELLL 2023
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
DOI:	10.1007/978-3-031-58495-4_12
Cite as:	arXiv:2404.18228 [cs.CL]
	(or arXiv:2404.18228v1 [cs.CL] for this version)

Submission history

From: Raviraj Joshi [view email]
[v1] Sun, 28 Apr 2024 15:44:57 GMT (917kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.18228

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: TextGram: Towards a better domain-adaptive pretraining

Submission history