ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature

Gray, Andrew

Full-text links:

Download:

PDF only

Current browse context:

cs.DL

< prev | next >

new | recent | 2403

Change to browse by:

Computer Science > Digital Libraries

Title: ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature

Authors: Andrew Gray

(Submitted on 25 Mar 2024)

Abstract: The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.

Comments:	12 pages, 6 figures
Subjects:	Digital Libraries (cs.DL)
Cite as:	arXiv:2403.16887 [cs.DL]
	(or arXiv:2403.16887v1 [cs.DL] for this version)

Submission history

From: Andrew Gray [view email]
[v1] Mon, 25 Mar 2024 15:56:37 GMT (228kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.16887

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Digital Libraries

Title: ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature

Submission history