Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Wang, Wenhui; Bao, Hangbo; Dong, Li; Bjorck, Johan; Peng, Zhiliang; Liu, Qiang; Aggarwal, Kriti; Mohammed, Owais Khan; Singhal, Saksham; Som, Subhojit; Wei, Furu

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2208

Computer Science > Computer Vision and Pattern Recognition

Title: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Authors: Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

(Submitted on 22 Aug 2022 (v1), last revised 31 Aug 2022 (this version, v2))

Abstract: A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Comments:	18 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2208.10442 [cs.CV]
	(or arXiv:2208.10442v2 [cs.CV] for this version)

Submission history

From: Li Dong [view email]
[v1] Mon, 22 Aug 2022 16:55:04 GMT (476kb,D)
[v2] Wed, 31 Aug 2022 02:26:45 GMT (287kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2208.10442

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Submission history