Related papers: Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

URL: http://arxiv.org/abs/2412.00142v2
Date: Mon, 13 Jan 2025 23:45:26 GMT
Title: Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers
Authors: Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig,
Abstract summary: Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering.<n>We propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks.
Score: 79.45405711339322
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

Related papers

Large Multimodal Models as General In-Context Classifiers [73.11242790834383]
In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning.<n>We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters.<n>We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task.
arXiv Detail & Related papers (2026-02-26T17:08:18Z)
Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? [8.45007357012084]
We investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers.<n>Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics.<n>We find that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance.
arXiv Detail & Related papers (2025-12-22T07:36:20Z)
Vision-Centric Activation and Coordination for Multimodal Large Language Models [42.26911585599856]
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities.<n>However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information.<n>We introduce VaCo, which optimize MLLM representations through Vision-Centric activation and Coordination.
arXiv Detail & Related papers (2025-10-16T06:38:39Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks.<n>We propose TAMP, a simple yet effective pruning framework tailored for MLLMs.<n>We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies [0.9217021281095907]
This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. We show that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings.
arXiv Detail & Related papers (2024-12-06T16:13:19Z)
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models [27.45225442048711]
We introduce CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks. We also present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm.
arXiv Detail & Related papers (2024-10-21T16:30:29Z)
RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. Lumen first promotes fine-grained vision-language concept alignment. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method. We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response. We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.