Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

Wu, Zelin; Song, Gan; Li, Christopher; Rondon, Pat; Meng, Zhong; Velez, Xavier; Wang, Weiran; Caseiro, Diamantino; Pundak, Golan; Munkhdalai, Tsendsuren; Chandorkar, Angad; Prabhavalkar, Rohit

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2404

Computer Science > Computation and Language

Title: Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

Authors: Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar

(Submitted on 15 Apr 2024 (v1), last revised 23 Apr 2024 (this version, v2))

Abstract: Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.

Comments:	9 pages, 3 figures, accepted by NAACL 2024 - Industry Track
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Journal reference:	2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics - Industry Track
Cite as:	arXiv:2404.10180 [cs.CL]
	(or arXiv:2404.10180v2 [cs.CL] for this version)

Submission history

From: Zhong Meng [view email]
[v1] Mon, 15 Apr 2024 23:28:13 GMT (86kb,D)
[v2] Tue, 23 Apr 2024 13:43:26 GMT (81kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.10180

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

Submission history