Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Tao, Ruijie; Qian, Xinyuan; Jiang, Yidi; Li, Junjie; Wang, Jiadong; Li, Haizhou

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2404

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Authors: Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang, Haizhou Li

(Submitted on 29 Apr 2024 (v1), last revised 8 May 2024 (this version, v2))

Abstract: Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2404.18501 [eess.AS]
	(or arXiv:2404.18501v2 [eess.AS] for this version)

Submission history

From: Ruijie Tao [view email]
[v1] Mon, 29 Apr 2024 08:43:57 GMT (4087kb,D)
[v2] Wed, 8 May 2024 08:05:22 GMT (4087kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2404.18501

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Submission history