Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Deng, Yimin; Wang, Jianzong; Zhang, Xulong; Cheng, Ning; Xiao, Jing

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2405

Computer Science > Sound

Title: Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Authors: Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

(Submitted on 1 May 2024)

Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.

Comments:	Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.00603 [cs.SD]
	(or arXiv:2405.00603v1 [cs.SD] for this version)

Submission history

From: Yimin Deng [view email]
[v1] Wed, 1 May 2024 16:14:22 GMT (15610kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.00603

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Sound

Title: Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Submission history