We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.SD

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Sound

Title: ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Abstract: Existing neural audio codecs usually sacrifice computational complexity for audio quality. They build the feature transformation layers mainly on convolutional blocks, which are not inherently appropriate for capturing local redundancies of audio signals. As compensation, either adversarial losses from a discriminator or a large number of model parameters are required to improve the codec. To that end, we propose Efficient Speech Codec (ESC), a lightweight parameter-efficient codec laid on cross-scale residual vector quantization and transformers. Our model leverages mirrored hierarchical window-attention transformer blocks and performs step-wise decoding from coarse-to-fine feature representations. To enhance codebook utilization, we design a learning paradigm that involves a pre-training stage to assist with codec training. Extensive results show that ESC can achieve high audio quality with much lower complexity, which is a prospective alternative in place of existing codecs.
Comments: Preprint
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2404.19441 [cs.SD]
  (or arXiv:2404.19441v1 [cs.SD] for this version)

Submission history

From: Yuzhe Gu [view email]
[v1] Tue, 30 Apr 2024 10:44:33 GMT (712kb,D)

Link back to: arXiv, form interface, contact.