Reasoning to Attend: Try to Understand How Token Works
- URL: http://arxiv.org/abs/2412.17741v3
- Date: Mon, 20 Jan 2025 07:57:50 GMT
- Title: Reasoning to Attend: Try to Understand How <SEG> Token Works
- Authors: Rui Qian, Xin Yin, Dejing Dou,
- Abstract summary: We show that $texttSEG>$ token contributes to semantic similarity within image-text pairs.
We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
- Score: 44.33848900059659
- License:
- Abstract: Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{<SEG>}$ token as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM). However, we observe that little research has looked into how it works. In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{<SEG>}$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map,which reveals that what $\texttt{<SEG>}$ token contributes to is the semantic similarity within image-text pairs. Specifically, $\texttt{<SEG>}$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{<SEG>}$-like paradigms in a plug-and-play fashion. Also, extensive experiments have been conducted on the ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at https://github.com/rui-qian/READ.
Related papers
- A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens [20.37803751979975]
When feeding a text into a large language model-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text.
We show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method.
arXiv Detail & Related papers (2024-06-25T08:55:12Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - What Are You Token About? Dense Retrieval as Distributions Over the
Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space.
We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z) - TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
Vision-Language Retrieval [30.429340065755436]
We devise a new model-agnostic formulation for fine-grained cross-modal alignment.
Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
arXiv Detail & Related papers (2022-09-28T04:11:05Z) - Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation.
We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels.
STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.