MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Lei, Youbo; He, Feifei; Chen, Chen; Mo, Yingbin; Li, Si Jia; Xie, Defeng; Lu, Haonan

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2310

Computer Science > Computer Vision and Pattern Recognition

Title: MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Authors: Youbo Lei, Feifei He, Chen Chen, Yingbin Mo, Si Jia Li, Defeng Xie, Haonan Lu

(Submitted on 30 Oct 2023 (v1), last revised 2 Apr 2024 (this version, v3))

Abstract: Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference.We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity.Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model on Snapdragon/Dimensity chips with only $\sim$100M running memory and $\sim$8.0ms search latency, achieving the mobile-device application of VLP models.

Comments:	Accepted by NAACL 2024 Findings
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.19654 [cs.CV]
	(or arXiv:2310.19654v3 [cs.CV] for this version)

Submission history

From: Chen Chen [view email]
[v1] Mon, 30 Oct 2023 15:38:43 GMT (8247kb,D)
[v2] Thu, 28 Mar 2024 08:47:14 GMT (9499kb,D)
[v3] Tue, 2 Apr 2024 00:12:21 GMT (9499kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2310.19654

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Submission history