Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Liu, Yuan; Zhang, Songyang; Chen, Jiacheng; Yu, Zhaohui; Chen, Kai; Lin, Dahua

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2308

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Authors: Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin

(Submitted on 1 Aug 2023)

Abstract: There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at this https URL

Comments:	Accepted by ICCV2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.00261 [cs.CV]
	(or arXiv:2308.00261v1 [cs.CV] for this version)

Submission history

From: Yuan Liu [view email]
[v1] Tue, 1 Aug 2023 03:44:56 GMT (557kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2308.00261

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Submission history