Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Cui, Chenhao; Liang, Xinnian; Wu, Shuangzhi; Li, Zhoujun

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2208

Computer Science > Computation and Language

Title: Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Authors: Chenhao Cui, Xinnian Liang, Shuangzhi Wu, Zhoujun Li

(Submitted on 24 Aug 2022 (v1), last revised 10 May 2023 (this version, v3))

Abstract: Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected summary-related images in the final summary. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that two well-designed tasks and joint multi-modal encoder can effectively guide the model to learn reasonable paragraphs-images and summary-images relations.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2208.11303 [cs.CL]
	(or arXiv:2208.11303v3 [cs.CL] for this version)

Submission history

From: Xinnian Liang [view email]
[v1] Wed, 24 Aug 2022 05:18:23 GMT (26838kb,D)
[v2] Wed, 5 Apr 2023 09:01:21 GMT (10430kb,D)
[v3] Wed, 10 May 2023 15:54:12 GMT (10430kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2208.11303

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Submission history