We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.SD

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Sound

Title: Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant

Abstract: This paper explores whether considering alternative domain-specific embeddings to calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual data from the same task, we find that PANNs-WGM-LogMel produces the best correlation between FAD scores and perceptual ratings of both audio quality and perceived fit with a Spearman correlation higher than 0.5. We also find that music-specific embeddings resulted in significantly lower results. Interestingly, VGGish, the embedding used for the original Fr\'echet calculation, yielded a correlation below 0.1. These results underscore the critical importance of the choice of embedding for the FAD metric design.
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2403.17508 [cs.SD]
  (or arXiv:2403.17508v1 [cs.SD] for this version)

Submission history

From: Mathieu Lagrange [view email]
[v1] Tue, 26 Mar 2024 09:09:59 GMT (90kb,D)

Link back to: arXiv, form interface, contact.