Pixel Aligned Language Models
- URL: http://arxiv.org/abs/2312.09237v1
- Date: Thu, 14 Dec 2023 18:57:58 GMT
- Title: Pixel Aligned Language Models
- Authors: Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun,
Xiaolong Wang, Cordelia Schmid
- Abstract summary: We develop a vision-language model that can take locations as either inputs or outputs.
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention.
- Score: 94.32841818609914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have achieved great success in recent years, so as
their variants in vision. Existing vision-language models can describe images
in natural languages, answer visual-related questions, or perform complex
reasoning about the image. However, it is yet unclear how localization tasks,
such as word grounding or referring localization, can be performed using large
language models. In this work, we aim to develop a vision-language model that
can take locations, for example, a set of points or boxes, as either inputs or
outputs. When taking locations as inputs, the model performs
location-conditioned captioning, which generates captions for the indicated
object or region. When generating locations as outputs, our model regresses
pixel coordinates for each output word generated by the language model, and
thus performs dense word grounding. Our model is pre-trained on the Localized
Narrative dataset, which contains pixel-word-aligned captioning from human
attention. We show our model can be applied to various location-aware
vision-language tasks, including referring localization, location-conditioned
captioning, and dense object captioning, archiving state-of-the-art performance
on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - Cross-modal Language Generation using Pivot Stabilization for Web-scale
Language Coverage [23.71195344840051]
Cross-modal language generation tasks such as image captioning are directly hurt by the trend of data-hungry models combined with the lack of non-English annotations.
We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations and their machine-translated versions.
We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages.
arXiv Detail & Related papers (2020-05-01T06:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.