Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Wang, Rui; Wu, Zuxuan; Chen, Dongdong; Chen, Yinpeng; Dai, Xiyang; Liu, Mengchen; Zhou, Luowei; Yuan, Lu; Jiang, Yu-Gang

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2208

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Authors: Rui Wang, Zuxuan Wu, Dongdong Chen, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Luowei Zhou, Lu Yuan, Yu-Gang Jiang

(Submitted on 25 Aug 2022)

Abstract: Transformer-based models have achieved top performance on major video recognition benchmarks. Benefiting from the self-attention mechanism, these models show stronger ability of modeling long-range dependencies compared to CNN-based models. However, significant computation overheads, resulted from the quadratic complexity of self-attention on top of a tremendous number of tokens, limit the use of existing video transformers in applications with limited resources like mobile devices. In this paper, we extend Mobile-Former to Video Mobile-Former, which decouples the video architecture into a lightweight 3D-CNNs for local context modeling and a Transformer modules for global interaction modeling in a parallel fashion. To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e.g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism. Through efficient global spatial-temporal modeling, Video Mobile-Former significantly improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks. It is worth noting that Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2208.12257 [cs.CV]
	(or arXiv:2208.12257v1 [cs.CV] for this version)

Submission history

From: Dongdong Chen [view email]
[v1] Thu, 25 Aug 2022 17:59:00 GMT (5547kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2208.12257

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Submission history