We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: LocCa: Visual Pretraining with Location-aware Captioners

Abstract: Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2403.19596 [cs.CV]
  (or arXiv:2403.19596v1 [cs.CV] for this version)

Submission history

From: Xiaohua Zhai [view email]
[v1] Thu, 28 Mar 2024 17:20:39 GMT (6541kb,D)

Link back to: arXiv, form interface, contact.