Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Ma, Rao; Liusie, Adian; Gales, Mark J. F.; Knill, Kate M.

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2311

Change to browse by:

Computer Science > Computation and Language

Title: Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Authors: Rao Ma, Adian Liusie, Mark J. F. Gales, Kate M. Knill

(Submitted on 15 Nov 2023 (v1), last revised 28 Mar 2024 (this version, v2))

Abstract: Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

Comments:	NAACL 2024 (main conference)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.09363 [cs.CL]
	(or arXiv:2311.09363v2 [cs.CL] for this version)

Submission history

From: Rao Ma [view email]
[v1] Wed, 15 Nov 2023 20:52:56 GMT (9185kb,D)
[v2] Thu, 28 Mar 2024 16:31:26 GMT (9335kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2311.09363v2

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Submission history