FlashSpeech: Efficient Zero-Shot Speech Synthesis

Ye, Zhen; Ju, Zeqian; Liu, Haohe; Tan, Xu; Chen, Jianyi; Lu, Yiwen; Sun, Peiwen; Pan, Jiahao; Bian, Weizhen; He, Shulin; Liu, Qifeng; Guo, Yike; Xue, Wei

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2404

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: FlashSpeech: Efficient Zero-Shot Speech Synthesis

Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

(Submitted on 23 Apr 2024 (v1), last revised 25 Apr 2024 (this version, v3))

Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL

Comments:	Efficient zero-shot speech synthesis
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2404.14700 [eess.AS]
	(or arXiv:2404.14700v3 [eess.AS] for this version)

Submission history

From: Zhen Ye [view email]
[v1] Tue, 23 Apr 2024 02:57:46 GMT (2418kb,D)
[v2] Wed, 24 Apr 2024 07:18:50 GMT (1921kb,D)
[v3] Thu, 25 Apr 2024 03:38:46 GMT (1921kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2404.14700

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: FlashSpeech: Efficient Zero-Shot Speech Synthesis

Submission history