The Pyramid of Captions

Chen, Delong; Cahyawijaya, Samuel; Ishii, Etsuko; Chan, Ho Shu; Bang, Yejin; Fung, Pascale

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Change to browse by:

References & Citations

NASA ADS

Bookmark

(what is this?)

Computer Science > Computer Vision and Pattern Recognition

Title: The Pyramid of Captions

Authors: Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung

(Submitted on 1 May 2024)

Abstract: We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.00485 [cs.CV]
	(or arXiv:2405.00485v1 [cs.CV] for this version)

Submission history

From: Delong Chen [view email]
[v1] Wed, 1 May 2024 12:49:57 GMT (7095kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.00485

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: The Pyramid of Captions

Submission history