We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

Authors and titles for eess.AS in Mar 2024

[ total of 213 entries: 1-213 ]
[ showing 213 entries per page: fewer | more ]
[1]  arXiv:2403.00293 [pdf, other]
Title: Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[2]  arXiv:2403.00379 [pdf, other]
Title: The Impact of Frequency Bands on Acoustic Anomaly Detection of Machines using Deep Learning Based Model
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[3]  arXiv:2403.00887 [pdf, other]
Title: SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[4]  arXiv:2403.01130 [pdf, other]
Title: Arbitrary Discrete Fourier Analysis and Its Application in Replayed Speech Detection
Authors: Shih-Kuang Lee
Comments: this https URL
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[5]  arXiv:2403.01355 [pdf, ps, other]
Title: a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification
Comments: 8 pages, submitted to Speaker Odyssey 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[6]  arXiv:2403.01369 [pdf, other]
Title: A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
Comments: 8 pages; Shorter form accepted in ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[7]  arXiv:2403.01494 [pdf, other]
Title: PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion
Comments: Accepted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[8]  arXiv:2403.01670 [pdf, other]
Title: 6DoF SELD: Sound Event Localization and Detection Using Microphones and Motion Tracking Sensors on self-motioning human
Comments: ICASSP2024 accepted
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[9]  arXiv:2403.02167 [pdf, other]
Title: Speech emotion recognition from voice messages recorded in the wild
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[10]  arXiv:2403.02288 [pdf, other]
Title: PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
Comments: submitted to Speaker Odyssey 2024
Subjects: Audio and Speech Processing (eess.AS)
[11]  arXiv:2403.02371 [pdf, other]
Title: NeuroVoz: a Castillian Spanish corpus of parkinsonian speech
Comments: Preprint version
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[12]  arXiv:2403.03100 [pdf, other]
Title: NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[13]  arXiv:2403.03611 [pdf, ps, other]
Title: Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition Task
Authors: Dang Thoai Phan
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[14]  arXiv:2403.04433 [pdf, ps, other]
Title: On the Use of Autoregressive Methods for Audio Inpainting
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[15]  arXiv:2403.04743 [pdf, other]
Title: Speech Emotion Recognition Via CNN-Transforemr and Multidimensional Attention Mechanism
Subjects: Audio and Speech Processing (eess.AS)
[16]  arXiv:2403.04800 [pdf, other]
Title: (Un)paired signal-to-signal translation with 1D conditional GANs
Authors: Eric Easthope
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
[17]  arXiv:2403.04804 [pdf, other]
Title: AttentionStitch: How Attention Solves the Speech Editing Problem
Comments: Accepted in Machine Learning for Audio workship in NeurIPS 2023
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
[18]  arXiv:2403.05187 [pdf, other]
Title: Robust Semantic Communications for Speech Transmission
Subjects: Audio and Speech Processing (eess.AS)
[19]  arXiv:2403.05393 [pdf, other]
Title: Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS)
[20]  arXiv:2403.05791 [pdf, other]
Title: Asynchronous Microphone Array Calibration using Hybrid TDOA Information
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[21]  arXiv:2403.05887 [pdf, other]
Title: Aligning Speech to Languages to Enhance Code-switching Speech Recognition
Comments: Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects: Audio and Speech Processing (eess.AS)
[22]  arXiv:2403.06847 [pdf, other]
Title: SonoTraceLab -- A Raytracing-Based Acoustic Modelling System for Simulating Echolocation Behavior of Bats
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[23]  arXiv:2403.06856 [pdf, other]
Title: Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach
Comments: 5 pages, 6 tables, 2 figures
Subjects: Audio and Speech Processing (eess.AS)
[24]  arXiv:2403.07579 [pdf, other]
Title: On HRTF Notch Frequency Prediction Using Anthropometric Features and Neural Networks
Subjects: Audio and Speech Processing (eess.AS)
[25]  arXiv:2403.07661 [pdf, other]
Title: Gender-ambiguous voice generation through feminine speaking style transfer in male voices
Comments: submitted to Interspeech
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[26]  arXiv:2403.07767 [pdf, ps, other]
Title: Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
[27]  arXiv:2403.07937 [pdf, other]
Title: Speech Robust Bench: A Robustness Benchmark For Speech Recognition
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[28]  arXiv:2403.07947 [pdf, ps, other]
Title: The evaluation of a code-switched Sepedi-English automatic speech recognition system
Comments: 13 pages,2 figures,2nd International Conference on NLP & AI (NLPAI 2024)
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[29]  arXiv:2403.08654 [pdf, other]
Title: An Efficient End-to-End Approach to Noise Invariant Speech Features via Multi-Task Learning
Comments: Under review on IEEE Transactions on Audio, Speech, and Language Processing (2024)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[30]  arXiv:2403.09524 [pdf, other]
Title: Physics-Informed Neural Network for Volumetric Sound field Reconstruction of Speech Signals
Subjects: Audio and Speech Processing (eess.AS)
[31]  arXiv:2403.09527 [pdf, other]
Title: WavCraft: Audio Editing and Generation with Large Language Models
Subjects: Audio and Speech Processing (eess.AS)
[32]  arXiv:2403.09789 [pdf, other]
Title: Audiosockets: A Python socket package for Real-Time Audio Processing
Comments: 4 pages, 2 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[33]  arXiv:2403.10271 [pdf, other]
Title: SuperME: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR
Authors: Zhong-Qiu Wang
Comments: in submission
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[34]  arXiv:2403.10420 [pdf, other]
Title: Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks
Subjects: Audio and Speech Processing (eess.AS)
[35]  arXiv:2403.10428 [pdf, other]
Title: How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses
Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details
Subjects: Audio and Speech Processing (eess.AS)
[36]  arXiv:2403.10548 [pdf, other]
Title: Two-sided Acoustic Metascreen for Broadband and Individual Reflection and Transmission Control
Authors: Ao Chen, Xin Zhang
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[37]  arXiv:2403.10565 [pdf, other]
Title: PTSD-MDNN : Fusion tardive de réseaux de neurones profonds multimodaux pour la détection du trouble de stress post-traumatique
Comments: in French language. GRETSI 2023
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
[38]  arXiv:2403.10756 [pdf, other]
Title: Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Comments: Submitted to EUSIPCO2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[39]  arXiv:2403.10937 [pdf, other]
Title: Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR
Comments: 14 pages, 7 figures, Accepted in Sadhana Journal
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[40]  arXiv:2403.11037 [pdf, other]
Title: Fine-Grained Engine Fault Sound Event Detection Using Multimodal Signals
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[41]  arXiv:2403.11508 [pdf, other]
Title: Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection
Comments: Submitted to EUSIPCO 2024
Subjects: Audio and Speech Processing (eess.AS)
[42]  arXiv:2403.11578 [pdf, other]
Title: AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition
Subjects: Audio and Speech Processing (eess.AS)
[43]  arXiv:2403.12182 [pdf, other]
Title: Latent CLAP Loss for Better Foley Sound Synthesis
Subjects: Audio and Speech Processing (eess.AS)
[44]  arXiv:2403.12258 [pdf, other]
Title: A Multi-loudspeaker Binaural Room Impulse Response Dataset with High-Resolution Translational and Rotational Head Coordinates in a Listening Room
Comments: Submitted to Frontiers in Signal Processing
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[45]  arXiv:2403.12630 [pdf, other]
Title: Reproducing the Acoustic Velocity Vectors in a Circular Listening Area
Comments: Submitted to EUSIPCO 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[46]  arXiv:2403.13332 [pdf, other]
Title: TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer
Comments: Accepted by ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[47]  arXiv:2403.13356 [pdf, other]
Title: KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Image and Video Processing (eess.IV)
[48]  arXiv:2403.13465 [pdf, other]
Title: BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[49]  arXiv:2403.13643 [pdf, ps, other]
Title: Vibration Sensitivity of one-port and two-port MEMS microphones
Comments: 8 pages, 14 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[50]  arXiv:2403.14179 [pdf, ps, other]
Title: AdaProj: Adaptively Scaled Angular Margin Subspace Projections for Anomalous Sound Detection with Auxiliary Classification Tasks
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[51]  arXiv:2403.14246 [pdf, other]
Title: CATSE: A Context-Aware Framework for Causal Target Sound Extraction
Comments: Submitted to EUSIPCO 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
[52]  arXiv:2403.14268 [pdf, ps, other]
Title: Speech-Aware Neural Diarization with Encoder-Decoder Attractor Guided by Attention Constraints
Comments: Accepted to The 28th International Conference on Technologies and Applications of Artificial Intelligence (TAAI), in Chinese language
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[53]  arXiv:2403.14817 [pdf, other]
Title: Crowdsourced Multilingual Speech Intelligibility Testing
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
[54]  arXiv:2403.15336 [pdf, other]
Title: Dialogue Understandability: Why are we streaming movies with subtitles?
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM)
[55]  arXiv:2403.15442 [pdf, other]
Title: Advanced Artificial Intelligence Algorithms in Cochlear Implants: Review of Healthcare Strategies, Challenges, and Perspectives
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[56]  arXiv:2403.16610 [pdf, ps, other]
Title: Distributed collaborative anomalous sound detection by embedding sharing
Subjects: Audio and Speech Processing (eess.AS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD)
[57]  arXiv:2403.16973 [pdf, other]
Title: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Comments: Data, code, and model weights are available at this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[58]  arXiv:2403.17402 [pdf, other]
Title: Infrastructure-less Localization from Indoor Environmental Sounds Based on Spectral Decomposition and Spatial Likelihood Model
Comments: 6 pages, 6 figures, accepted to IEEE/SICE SII 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[59]  arXiv:2403.17514 [pdf, other]
Title: Speaker Distance Estimation in Enclosures from Single-Channel Audio
Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[60]  arXiv:2403.17864 [pdf, other]
Title: Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification
Comments: Submitted to EUSIPCO 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[61]  arXiv:2403.18257 [pdf, other]
Title: Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
Comments: work in progress
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[62]  arXiv:2403.18560 [pdf, other]
Title: Noise-Robust Keyword Spotting through Self-supervised Pretraining
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[63]  arXiv:2403.18636 [pdf, other]
Title: A Diffusion-Based Generative Equalizer for Music Restoration
Comments: Submitted to DAFx24. Historical music restoration examples are available at: this http URL
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[64]  arXiv:2403.18638 [pdf, other]
Title: Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection
Subjects: Audio and Speech Processing (eess.AS)
[65]  arXiv:2403.19207 [pdf, other]
Title: LV-CTC: Non-autoregressive ASR with CTC and latent variable models
Subjects: Audio and Speech Processing (eess.AS)
[66]  arXiv:2403.19217 [pdf, other]
Title: Blind Identification of Binaural Room Impulse Responses from Smart Glasses
Subjects: Audio and Speech Processing (eess.AS)
[67]  arXiv:2403.19709 [pdf, other]
Title: Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models
Comments: 5 pages, 3 figures, 5 tables
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
[68]  arXiv:2403.19971 [pdf, other]
Title: 3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[69]  arXiv:2403.20090 [pdf, other]
Title: Non-Exponential Reverberation Modeling Using Dark Velvet Noise
Comments: Accepted for publication in the Journal of Audio Engineering Society
Subjects: Audio and Speech Processing (eess.AS)
[70]  arXiv:2403.20184 [pdf, other]
Title: Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context
Comments: Accepted at LREC-COLING 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[71]  arXiv:2403.03762 (cross-list from eess.SP) [pdf, ps, other]
Title: Room Impulse Response Estimation using Optimal Transport: Simulation-Informed Inference
Subjects: Signal Processing (eess.SP); Audio and Speech Processing (eess.AS)
[72]  arXiv:2403.10329 (cross-list from eess.SP) [pdf, ps, other]
Title: Multi-Source Localization and Data Association for Time-Difference of Arrival Measurements
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Optimization and Control (math.OC)
[73]  arXiv:2403.00212 (cross-list from cs.CL) [pdf, other]
Title: Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[74]  arXiv:2403.00274 (cross-list from cs.CV) [pdf, other]
Title: CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[75]  arXiv:2403.00370 (cross-list from cs.CL) [pdf, other]
Title: Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[76]  arXiv:2403.00529 (cross-list from cs.SD) [pdf, other]
Title: VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
Comments: preprint
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[77]  arXiv:2403.00790 (cross-list from cs.SD) [pdf, ps, other]
Title: Structuring Concept Space with the Musical Circle of Fifths by Utilizing Music Grammar Based Activations
Authors: Tofara Moyo
Comments: 3 pages
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[78]  arXiv:2403.00854 (cross-list from q-bio.NC) [pdf, other]
Title: Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning
Comments: 17 pages, 2 tables, 4 main figures, 2 supplemental figures, prepared for journal submission
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[79]  arXiv:2403.00977 (cross-list from cs.SD) [pdf, other]
Title: Scaling Up Adaptive Filter Optimizers
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[80]  arXiv:2403.01087 (cross-list from cs.MM) [pdf, other]
Title: Towards Accurate Lip-to-Speech Synthesis in-the-Wild
Comments: 8 pages of content, 1 page of references and 4 figures
Journal-ref: In Proceedings of the 31st ACM International Conference on Multimedia, 2023
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[81]  arXiv:2403.01132 (cross-list from cs.LG) [pdf, ps, other]
Title: MPIPN: A Multi Physics-Informed PointNet for solving parametric acoustic-structure systems
Comments: The number of figures is 16. The number of tables is 5. The number of words is 9717
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[82]  arXiv:2403.01255 (cross-list from cs.SD) [pdf, other]
Title: Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey
Journal-ref: Information Fusion, Elsevier, 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[83]  arXiv:2403.01278 (cross-list from cs.SD) [pdf, other]
Title: Enhancing Audio Generation Diversity with Visual Information
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[84]  arXiv:2403.01699 (cross-list from cs.CL) [pdf, other]
Title: Brilla AI: AI Contestant for the National Science and Maths Quiz
Comments: 14 pages. Accepted for the WideAIED track at the 25th International Conference on AI in Education (AIED 2024)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[85]  arXiv:2403.01700 (cross-list from cs.SD) [pdf, other]
Title: Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
Comments: Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[86]  arXiv:2403.01785 (cross-list from cs.SD) [pdf, other]
Title: What do neural networks listen to? Exploring the crucial bands in Speech Enhancement using Sinc-convolution
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[87]  arXiv:2403.01792 (cross-list from cs.SD) [pdf, other]
Title: ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[88]  arXiv:2403.01960 (cross-list from cs.SD) [pdf, other]
Title: A robust audio deepfake detection system via multi-view feature
Comments: 5 pages, 2 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[89]  arXiv:2403.02002 (cross-list from cs.SD) [pdf, other]
Title: Fine-Grained Quantitative Emotion Editing for Speech Generation
Comments: This paper is submitted to IEEE Signal Processing Letters
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[90]  arXiv:2403.02010 (cross-list from cs.SD) [pdf, other]
Title: SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[91]  arXiv:2403.02687 (cross-list from cs.HC) [src]
Title: Enhanced DareFightingICE Competitions: Sound Design and AI Competitions
Comments: This paper describes a new competition platform using Unity for our competitions at the 2024 IEEE Conference on Games (CoG 2024). It was accepted for presentation at CoG 2024. However, we recently discovered a much more effective way to do this task without using Unity, leading to our decision to withdraw the paper from CoG 2024 and ArXiv
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[92]  arXiv:2403.02701 (cross-list from cs.SD) [pdf, other]
Title: Fighting Game Adaptive Background Music for Improved Gameplay
Comments: This is an updated version of our IEEE CoG 2023 paper (this https URL). This version has revised the description of the association between the distance between the two players (PD) and the instrument's volume on page 2. arXiv admin note: substantial text overlap with arXiv:2303.15734
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[93]  arXiv:2403.02918 (cross-list from cs.RO) [pdf, other]
Title: Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction
Comments: Accepted by ACM Technological Advances in Human-Robot Interaction. 9 pages
Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[94]  arXiv:2403.02938 (cross-list from cs.CL) [pdf, other]
Title: AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models
Journal-ref: AHs '23: Proceedings of the Augmented Humans International Conference 2023
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[95]  arXiv:2403.03095 (cross-list from cs.CV) [pdf, other]
Title: Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Comments: Accepted To ICASSP2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[96]  arXiv:2403.03145 (cross-list from cs.CV) [pdf, other]
Title: Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Comments: Accepted to NeurIPS2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[97]  arXiv:2403.03224 (cross-list from physics.soc-ph) [pdf, other]
Title: Reinforcement Learning Jazz Improvisation: When Music Meets Game Theory
Comments: 16 pages, 4 figures
Subjects: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[98]  arXiv:2403.03395 (cross-list from cs.SD) [pdf, other]
Title: Interactive Melody Generation System for Enhancing the Creativity of Musicians
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
[99]  arXiv:2403.03411 (cross-list from cs.SD) [pdf, other]
Title: CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation
Comments: 9 pages
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[100]  arXiv:2403.03510 (cross-list from cs.SD) [pdf, other]
Title: METAMAT 01: A semi-analytic Solution for Benchmarking Wave Propagation Simulations of homogeneous Absorbers in 1D/3D and 2D
Comments: 4
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Classical Physics (physics.class-ph)
[101]  arXiv:2403.03522 (cross-list from cs.SD) [pdf, other]
Title: Non-verbal information in spontaneous speech -- towards a new framework of analysis
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[102]  arXiv:2403.03538 (cross-list from cs.SD) [pdf, other]
Title: RADIA -- Radio Advertisement Detection with Intelligent Analytics
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[103]  arXiv:2403.03947 (cross-list from cs.SD) [pdf, other]
Title: Can Audio Reveal Music Performance Difficulty? Insights from the Piano Syllabus Dataset
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[104]  arXiv:2403.04111 (cross-list from cs.SD) [pdf, ps, other]
Title: Multi-Level Attention Aggregation for Language-Agnostic Speaker Replication
Comments: Accepted to EACL Main 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[105]  arXiv:2403.04178 (cross-list from cs.CL) [pdf, other]
Title: Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[106]  arXiv:2403.04245 (cross-list from cs.SD) [pdf, other]
Title: A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Comments: the paper is accepted by CVPR2024
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[107]  arXiv:2403.04594 (cross-list from cs.SD) [pdf, other]
Title: A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[108]  arXiv:2403.04654 (cross-list from cs.CV) [pdf, other]
Title: Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
Comments: Accepted to FG2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[109]  arXiv:2403.04661 (cross-list from cs.CV) [pdf, other]
Title: Dynamic Cross Attention for Audio-Visual Person Verification
Comments: Accepted to FG2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[110]  arXiv:2403.05010 (cross-list from cs.SD) [pdf, other]
Title: RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[111]  arXiv:2403.05380 (cross-list from cs.SD) [pdf, other]
Title: Spectrogram-Based Detection of Auto-Tuned Vocals in Music Recordings
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[112]  arXiv:2403.05583 (cross-list from cs.HC) [pdf, other]
Title: A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[113]  arXiv:2403.05772 (cross-list from cs.SD) [pdf, other]
Title: sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks
Comments: Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
[114]  arXiv:2403.05820 (cross-list from cs.SD) [pdf, other]
Title: An Audio-textual Diffusion Model For Converting Speech Signals Into Ultrasound Tongue Imaging Data
Comments: ICASSP2024 Accept
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[115]  arXiv:2403.05834 (cross-list from cs.MM) [pdf, other]
Title: Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information
Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[116]  arXiv:2403.05989 (cross-list from cs.SD) [pdf, other]
Title: HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[117]  arXiv:2403.06100 (cross-list from cs.HC) [pdf, other]
Title: Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[118]  arXiv:2403.06260 (cross-list from cs.CL) [pdf, other]
Title: SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations
Comments: Accepted at ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[119]  arXiv:2403.06387 (cross-list from cs.SD) [pdf, other]
Title: Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2210.13318
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[120]  arXiv:2403.06404 (cross-list from cs.SD) [pdf, other]
Title: Cosine Scoring with Uncertainty for Neural Speaker Embedding
Comments: 5 pages, 4 figures
Journal-ref: IEEE Signal Processing Letters 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[121]  arXiv:2403.06487 (cross-list from cs.CL) [pdf, other]
Title: Multilingual Turn-taking Prediction Using Voice Activity Projection
Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[122]  arXiv:2403.07675 (cross-list from cs.SD) [pdf, other]
Title: Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[123]  arXiv:2403.07802 (cross-list from cs.SD) [pdf, other]
Title: Boosting keyword spotting through on-device learnable user speech characteristics
Comments: 5 pages, 3 tables, 2 figures. Accepted as a full paper by the tinyML Research Symposium 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[124]  arXiv:2403.07938 (cross-list from cs.SD) [pdf, other]
Title: Text-to-Audio Generation Synchronized with Videos
Comments: arXiv admin note: text overlap with arXiv:2305.12903
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[125]  arXiv:2403.07995 (cross-list from cs.SD) [pdf, ps, other]
Title: Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation
Comments: Accepted to 13th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART) 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Audio and Speech Processing (eess.AS)
[126]  arXiv:2403.08164 (cross-list from cs.SD) [pdf, other]
Title: EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
Comments: Accepted by the 27th IEEE International Conference on Computer Supported Cooperative Work in Design (IEEE CSCWD 2024). arXiv admin note: substantial text overlap with arXiv:2211.01948
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[127]  arXiv:2403.08187 (cross-list from cs.CL) [pdf, other]
Title: Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children
Comments: 12 pages, 2 figures
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[128]  arXiv:2403.08196 (cross-list from cs.CL) [pdf, other]
Title: SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[129]  arXiv:2403.08525 (cross-list from cs.SD) [pdf, other]
Title: From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning
Comments: Under review at EUSIPCO 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[130]  arXiv:2403.08559 (cross-list from cs.SD) [pdf, other]
Title: End-to-End Amp Modeling: From Data to Controllable Guitar Amplifier Models
Comments: Presented at ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[131]  arXiv:2403.08738 (cross-list from cs.CL) [pdf, other]
Title: Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations
Comments: Accepted to EACL 2024 Main Conference, Long paper
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[132]  arXiv:2403.09030 (cross-list from cs.SD) [pdf, ps, other]
Title: An AI-Driven Approach to Wind Turbine Bearing Fault Diagnosis from Acoustic Signals
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[133]  arXiv:2403.09298 (cross-list from cs.SD) [pdf, ps, other]
Title: More than words: Advancements and challenges in speech recognition for singing
Authors: Anna Kruspe
Comments: Conference on Electronic Speech Signal Processing (ESSV) 2024, Keynote
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[134]  arXiv:2403.09321 (cross-list from cs.SD) [pdf, other]
Title: A Practical Guide to Spectrogram Analysis for Audio Signal Processing
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[135]  arXiv:2403.09407 (cross-list from cs.SD) [pdf, other]
Title: LM2D: Lyrics- and Music-Driven Dance Synthesis
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[136]  arXiv:2403.09451 (cross-list from cs.CV) [pdf, other]
Title: M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment
Journal-ref: Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2 VISAPP: VISAPP, 869-876, 2024 , Rome, Italy
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[137]  arXiv:2403.09455 (cross-list from cs.SD) [pdf, other]
Title: The Neural-SRP method for positional sound source localization
Comments: Presented at Asilomar Conference on Signals, Systems, and Computers
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[138]  arXiv:2403.09579 (cross-list from cs.SD) [pdf, other]
Title: uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
Comments: 5 pages, 6 figures, 4 tables. To appear in ICASSP'2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[139]  arXiv:2403.09598 (cross-list from cs.SD) [pdf, other]
Title: Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[140]  arXiv:2403.09753 (cross-list from cs.SD) [pdf, other]
Title: SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages
Comments: Accepted as a full paper by the tinyML Research Symposium 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[141]  arXiv:2403.10024 (cross-list from cs.SD) [pdf, other]
Title: MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[142]  arXiv:2403.10146 (cross-list from cs.SD) [pdf, other]
Title: Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval
Comments: 5 pages, accepted to ICASSP2024
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
[143]  arXiv:2403.10380 (cross-list from cs.SD) [pdf, other]
Title: BirdSet: A Multi-Task Benchmark for Classification in Computational Avian Bioacoustics
Comments: Work in progress, to be submitted @DMLR next month
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[144]  arXiv:2403.10488 (cross-list from cs.CV) [pdf, other]
Title: Joint Multimodal Transformer for Emotion Recognition in the Wild
Comments: 10 pages, 4 figures, 6 tables, CVPRw 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[145]  arXiv:2403.10493 (cross-list from cs.SD) [pdf, other]
Title: MusicHiFi: Fast High-Fidelity Stereo Vocoding
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[146]  arXiv:2403.10518 (cross-list from cs.CV) [pdf, other]
Title: Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
Comments: Accepted by CVPR2024, Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[147]  arXiv:2403.10549 (cross-list from cs.SD) [pdf, other]
Title: On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems
Comments: 5 pages, 2 tables, 2 figures. Accepted at IEEE AICAS 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[148]  arXiv:2403.10796 (cross-list from cs.SD) [pdf, other]
Title: CoPlay: Audio-agnostic Cognitive Scaling for Acoustic Sensing
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[149]  arXiv:2403.10805 (cross-list from cs.SD) [pdf, other]
Title: Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference
Comments: 12 pages,
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
[150]  arXiv:2403.10904 (cross-list from cs.SD) [pdf, other]
Title: Urban Sound Propagation: a Benchmark for 1-Step Generative Modeling of Complex Physical Systems
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[151]  arXiv:2403.10961 (cross-list from cs.LG) [pdf, other]
Title: Energy-Based Models with Applications to Speech and Language Processing
Authors: Zhijian Ou
Comments: The version before publisher editing
Journal-ref: Foundations and Trends in Signal Processing: Vol. 18: No. 1-2, pp 1-199
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[152]  arXiv:2403.11074 (cross-list from cs.CV) [pdf, other]
Title: Audio-Visual Segmentation via Unlabeled Frame Exploitation
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[153]  arXiv:2403.11091 (cross-list from cs.SD) [pdf, other]
Title: Multitask frame-level learning for few-shot sound event detection
Comments: 6 pages, 4 figures, conference
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[154]  arXiv:2403.11626 (cross-list from cs.GR) [pdf, other]
Title: QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation
Comments: Accepted by The Visual Computer Journal
Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[155]  arXiv:2403.11706 (cross-list from cs.SD) [pdf, other]
Title: Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models
Comments: Accepted at ICASSP 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[156]  arXiv:2403.11732 (cross-list from cs.SD) [pdf, other]
Title: Hallucination in Perceptual Metric-Driven Speech Enhancement Networks
Comments: Submitted to EUSIPCO 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[157]  arXiv:2403.11757 (cross-list from cs.MM) [pdf, other]
Title: Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
Subjects: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[158]  arXiv:2403.11778 (cross-list from cs.SD) [pdf, other]
Title: Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[159]  arXiv:2403.11780 (cross-list from cs.SD) [pdf, other]
Title: Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Comments: Accepted by NAACL 2024 (main conference)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[160]  arXiv:2403.11827 (cross-list from cs.SD) [pdf, other]
Title: Sound Event Detection and Localization with Distance Estimation
Comments: This paper has been submitted for the 32nd European Signal Processing Conference EUSIPCO 2024 in Lyon
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[161]  arXiv:2403.11879 (cross-list from cs.SD) [pdf, other]
Title: Unimodal Multi-Task Fusion for Emotional Mimicry Prediction
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[162]  arXiv:2403.12000 (cross-list from cs.SD) [pdf, other]
Title: Notochord: a Flexible Probabilistic Model for Real-Time MIDI Performance
Comments: 12 pages, 6 figures. Proceedings of the 3rd Conference on AI Music Creativity (2022, September 17)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[163]  arXiv:2403.12402 (cross-list from cs.CL) [pdf, other]
Title: An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[164]  arXiv:2403.12408 (cross-list from cs.CL) [pdf, other]
Title: MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[165]  arXiv:2403.12425 (cross-list from cs.CV) [pdf, other]
Title: Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
Comments: 8 pages,3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[166]  arXiv:2403.12477 (cross-list from cs.SD) [pdf, other]
Title: Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation
Comments: 5 pages, 3 figures, accepted at HSCMA 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[167]  arXiv:2403.13086 (cross-list from cs.SD) [pdf, other]
Title: Listenable Maps for Audio Classifiers
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[168]  arXiv:2403.13252 (cross-list from cs.SD) [pdf, other]
Title: Frequency-aware convolution for sound event detection
Authors: Tao Song
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[169]  arXiv:2403.13253 (cross-list from cs.CL) [pdf, other]
Title: Document Author Classification Using Parsed Language Structure
Journal-ref: International Journal on Natural Language Computing (IJNLC), Feb. 24, 2024
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[170]  arXiv:2403.13254 (cross-list from cs.SD) [pdf, other]
Title: Onset and offset weighted loss function for sound event detection
Authors: Tao Song
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[171]  arXiv:2403.13353 (cross-list from cs.SD) [pdf, other]
Title: Building speech corpus with diverse voice characteristics for its prompt-based representation
Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[172]  arXiv:2403.13423 (cross-list from cs.SD) [pdf, other]
Title: Advanced Long-Content Speech Recognition With Factorized Neural Transducer
Comments: Accepted by TASLP 2024
Journal-ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1803-1815, 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[173]  arXiv:2403.13659 (cross-list from cs.CV) [pdf, other]
Title: Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[174]  arXiv:2403.13720 (cross-list from cs.SD) [pdf, other]
Title: UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
Comments: 5 pages, 3 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[175]  arXiv:2403.13922 (cross-list from cs.CL) [pdf, other]
Title: Visually Grounded Speech Models have a Mutual Exclusivity Bias
Comments: Accepted to TACL, pre-MIT Press publication version
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[176]  arXiv:2403.14048 (cross-list from cs.SD) [pdf, ps, other]
Title: The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[177]  arXiv:2403.14083 (cross-list from cs.SD) [pdf, other]
Title: emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition
Comments: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[178]  arXiv:2403.14286 (cross-list from cs.SD) [pdf, other]
Title: Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Comments: Manuscript Under Review
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[179]  arXiv:2403.14290 (cross-list from cs.SD) [pdf, other]
Title: Exploring Green AI for Audio Deepfake Detection
Comments: This manuscript is under review in a conference
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[180]  arXiv:2403.14402 (cross-list from cs.SD) [pdf, other]
Title: XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[181]  arXiv:2403.14438 (cross-list from cs.CL) [pdf, other]
Title: A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Comments: arXiv admin note: text overlap with arXiv:2312.03632
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[182]  arXiv:2403.15469 (cross-list from cs.CL) [pdf, other]
Title: Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning
Comments: Accepted in NAACL2024 Findings
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[183]  arXiv:2403.15510 (cross-list from cs.CR) [pdf, other]
Title: Privacy-Preserving End-to-End Spoken Language Understanding
Comments: Accepted by IJCAI
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[184]  arXiv:2403.15523 (cross-list from q-bio.NC) [pdf, other]
Title: Towards auditory attention decoding with noise-tagging: A pilot study
Comments: 6 pages, 2 figures, 9th Graz Brain-Computer Interface Conference 2024
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[185]  arXiv:2403.15569 (cross-list from cs.SD) [pdf, other]
Title: Music to Dance as Language Translation using Sequence Models
Subjects: Sound (cs.SD); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
[186]  arXiv:2403.16078 (cross-list from cs.SD) [pdf, other]
Title: Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
Comments: Accepted by IJCNN 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[187]  arXiv:2403.16331 (cross-list from cs.SD) [pdf, other]
Title: Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[188]  arXiv:2403.16464 (cross-list from cs.SD) [pdf, other]
Title: Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator
Comments: Accepted to ICASSP 2024. Project page: this https URL
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[189]  arXiv:2403.16760 (cross-list from cs.HC) [pdf, ps, other]
Title: As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli
Comments: For study pre-registration, see this https URL
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[190]  arXiv:2403.16865 (cross-list from cs.CL) [pdf, other]
Title: Encoding of lexical tone in self-supervised models of spoken language
Comments: Accepted to NAACL 2024
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[191]  arXiv:2403.17327 (cross-list from cs.SD) [pdf, other]
Title: Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[192]  arXiv:2403.17376 (cross-list from cs.SD) [pdf, ps, other]
Title: Theoretical Analysis of Quality of Conventional Beamforming for Phased Microphone Arrays
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[193]  arXiv:2403.17378 (cross-list from cs.SD) [pdf, other]
Title: Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[194]  arXiv:2403.17379 (cross-list from cs.SD) [pdf, other]
Title: Exploring and Applying Audio-Based Sentiment Analysis in Music
Authors: Etash Jhanji
Comments: 5 pages, 7 figures, 2 tables. For source code, see this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[195]  arXiv:2403.17420 (cross-list from cs.CV) [pdf, other]
Title: Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
Comments: Accepted at CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[196]  arXiv:2403.17508 (cross-list from cs.SD) [pdf, other]
Title: Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[197]  arXiv:2403.17529 (cross-list from cs.SD) [pdf, other]
Title: Detection of Deepfake Environmental Audio
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[198]  arXiv:2403.17562 (cross-list from cs.SD) [pdf, other]
Title: Deep functional multiple index models with an application to SER
Comments: 5 pages, 1 figure
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Applications (stat.AP)
[199]  arXiv:2403.18572 (cross-list from cs.SD) [pdf, ps, other]
Title: ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[200]  arXiv:2403.18635 (cross-list from cs.LG) [pdf, other]
Title: Fusion approaches for emotion recognition from speech using acoustic and text-based features
Comments: 5 pages. Accepted in ICASSP 2020
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[201]  arXiv:2403.18811 (cross-list from cs.CV) [pdf, other]
Title: Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment
Comments: ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[202]  arXiv:2403.18821 (cross-list from cs.SD) [pdf, other]
Title: Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
Comments: Accepted to CVPR 2024. Project site: this https URL
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[203]  arXiv:2403.18843 (cross-list from cs.CV) [pdf, other]
Title: JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[204]  arXiv:2403.19002 (cross-list from cs.MM) [pdf, other]
Title: Robust Active Speaker Detection in Noisy Environments
Comments: 15 pages, 5 figures
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[205]  arXiv:2403.19224 (cross-list from cs.SD) [pdf, other]
Title: Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition
Comments: Accepted by 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[206]  arXiv:2403.19441 (cross-list from cs.SD) [pdf, other]
Title: A Novel Stochastic Transformer-based Approach for Post-Traumatic Stress Disorder Detection using Audio Recording of Clinical Interviews
Journal-ref: 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (2023) 700-705
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[207]  arXiv:2403.19509 (cross-list from cs.CL) [pdf, ps, other]
Title: Phonetic Segmentation of the UCLA Phonetics Lab Archive
Comments: Accepted at LREC-COLING 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[208]  arXiv:2403.19634 (cross-list from cs.SD) [pdf, ps, other]
Title: Asymmetric and trial-dependent modeling: the contribution of LIA to SdSV Challenge Task 2
Comments: LIA system description for the Short Duration Speaker Verification (SdSv) challenge 2020 Task 2
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[209]  arXiv:2403.19638 (cross-list from cs.CV) [pdf, other]
Title: Siamese Vision Transformers are Scalable Audio-visual Learners
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[210]  arXiv:2403.19763 (cross-list from cs.SD) [pdf, other]
Title: Creating Aesthetic Sonifications on the Web with SIREN
Comments: 7 pages, 1 figure, 5 listings, submitted to the Web Audio Conference 2024
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[211]  arXiv:2403.20130 (cross-list from cs.SD) [pdf, other]
Title: Sound event localization and classification using WASN in Outdoor Environment
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[212]  arXiv:2403.20202 (cross-list from cs.SD) [pdf, ps, other]
Title: Voice Signal Processing for Machine Learning. The Case of Speaker Isolation
Authors: Radan Ganchev
Comments: MSc. thesis. for associated source code, see this https URL
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[213]  arXiv:2403.20289 (cross-list from cs.CL) [pdf, other]
Title: Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation
Comments: Accepted by Findings of NAACL 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ total of 213 entries: 1-213 ]
[ showing 213 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, 2405, contact, help  (Access key information)