D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Pei, Wenjie; Tan, Qizhong; Lu, Guangming; Tian, Jiandong

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2312

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Authors: Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian

(Submitted on 3 Dec 2023 (v1), last revised 20 Apr 2024 (this version, v3))

Abstract: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter), which is a novel adapter tuning framework well-suited for few-shot action recognition due to lightweight design and low parameter-learning overhead. It is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. In particular, we devise the anisotropic Deformable Spatio-Temporal Attention module as the core component of D$^2$ST-Adapter, which can be tailored with anisotropic sampling densities along spatial and temporal domains to learn spatial and temporal features specifically in corresponding pathways, allowing our D$^2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.01431 [cs.CV]
	(or arXiv:2312.01431v3 [cs.CV] for this version)

Submission history

From: Qizhong Tan [view email]
[v1] Sun, 3 Dec 2023 15:40:10 GMT (4308kb,D)
[v2] Wed, 17 Apr 2024 12:36:06 GMT (7706kb,D)
[v3] Sat, 20 Apr 2024 14:15:36 GMT (7702kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2312.01431

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Submission history