Related papers: Line of Sight: On Linear Representations in VLLMs

Line of Sight: On Linear Representations in VLLMs

URL: http://arxiv.org/abs/2506.04706v1
Date: Thu, 05 Jun 2025 07:30:58 GMT
Title: Line of Sight: On Linear Representations in VLLMs
Authors: Achyuta Rajaram, Sarah Schwettmann, Jacob Andreas, Arthur Conmy,
Abstract summary: We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream.<n>In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs)<n>We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.
Score: 44.75626175851506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

Related papers

The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models [44.299894732492696]
Vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain.<n>We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow.<n>We show that ablating a single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
arXiv Detail & Related papers (2024-12-09T16:39:40Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks.<n>We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods.<n>Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
ImageBind-LLM: Multi-modality Instruction Tuning [70.05191504511188]
ImageBind-LLM is a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. It can respond to audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training.
arXiv Detail & Related papers (2023-09-07T17:59:45Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.