Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Wu, Wenxuan; Chen, Xueyuan; Wu, Xixin; Li, Haizhou; Meng, Helen

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2403

Computer Science > Sound

Title: Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

(Submitted on 24 Mar 2024)

Abstract: Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

Comments:	Accepted by IJCNN 2024
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2403.16078 [cs.SD]
	(or arXiv:2403.16078v1 [cs.SD] for this version)

Submission history

From: Wenxuan Wu [view email]
[v1] Sun, 24 Mar 2024 09:42:05 GMT (3234kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.16078

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

Submission history