Related papers: ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

URL: http://arxiv.org/abs/2406.11327v2
Date: Thu, 23 Jan 2025 14:50:47 GMT
Title: ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
Authors: Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Qixiang Ye,
Abstract summary: We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.<n>Our method unifies the prompt and answer of visual referential tasks without using additional syntax.<n>ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
Score: 71.03445074045092
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.

Related papers

How Can Objects Help Video-Language Understanding? [16.63183488540909]
We introduce ObjectML, a framework capable of leveraging arbitrary computer vision algorithm to extract and structured visual representation.<n>Through extensive evaluations on six video question benchmarks, we confirm that explicit integration of object-centric representation remains necessary.<n>Surprisingly, we observe that the simple approach quantizing the continuous, structured object information and representing them as plain text performs the best.
arXiv Detail & Related papers (2025-04-10T04:59:28Z)
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension [21.500920290909843]
We propose a new pretraining paradigm for Large Language Models (LLMs) to enhance their visual comprehension capabilities. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. We present a new foundation model called Croc, which achieves new state-of-the-art performance on massive vision-language benchmarks.
arXiv Detail & Related papers (2024-10-18T09:44:25Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z)
Visual Prompting in Multimodal Large Language Models: A Survey [95.75225825537528]
Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. Visual prompting has emerged for more fine-grained and free-form visual instructions. This paper focuses on visual prompting, prompt generation, compositional reasoning, and prompt learning.
arXiv Detail & Related papers (2024-09-05T08:47:34Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
Spoken Language Understanding for Conversational AI: Recent Advances and Future Direction [5.829344935864271]
This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques. We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task.
arXiv Detail & Related papers (2022-12-21T02:47:52Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.