PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Qi, Tianhua; Zheng, Wenming; Lu, Cheng; Zong, Yuan; Lian, Hailun

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2403

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Authors: Tianhua Qi, Wenming Zheng, Cheng Lu, Yuan Zong, Hailun Lian

(Submitted on 3 Mar 2024)

Abstract: In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods. Speech Samples are available at this https URL .

Comments:	Accepted to ICASSP2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
Cite as:	arXiv:2403.01494 [eess.AS]
	(or arXiv:2403.01494v1 [eess.AS] for this version)

Submission history

From: Tianhua Qi [view email]
[v1] Sun, 3 Mar 2024 12:07:19 GMT (1735kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2403.01494

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Submission history