SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Yun, Seokju; Ro, Youngmin

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2401

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Authors: Seokju Yun, Youngmin Ro

(Submitted on 29 Jan 2024 (v1), last revised 27 Mar 2024 (this version, v2))

Abstract: Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.16456 [cs.CV]
	(or arXiv:2401.16456v2 [cs.CV] for this version)

Submission history

From: Seokjoo Yun [view email]
[v1] Mon, 29 Jan 2024 09:12:23 GMT (6839kb,D)
[v2] Wed, 27 Mar 2024 04:14:59 GMT (7160kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2401.16456

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Submission history