AudioScenic: Audio-Driven Video Scene Editing

Shen, Kaixin; Quan, Ruijie; Zhu, Linchao; Xiao, Jun; Yang, Yi

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2404

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: AudioScenic: Audio-Driven Video Scene Editing

Authors: Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

(Submitted on 25 Apr 2024)

Abstract: Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude, enhancing the visual dynamics. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes, thus improving the overall temporal coherence of the edited videos. These integrated features enable AudioScenic to not only enhance visual diversity but also maintain temporal consistency throughout the video. We present a new metric named temporal score for more comprehensive validation of temporal consistency. We demonstrate substantial advancements of AudioScenic over competing methods on DAVIS and Audioset datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.16581 [cs.CV]
	(or arXiv:2404.16581v1 [cs.CV] for this version)

Submission history

From: Kaixin Shen [view email]
[v1] Thu, 25 Apr 2024 12:55:58 GMT (5829kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.16581

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: AudioScenic: Audio-Driven Video Scene Editing

Submission history