LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
- URL: http://arxiv.org/abs/2602.00462v2
- Date: Mon, 09 Feb 2026 13:54:50 GMT
- Title: LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
- Authors: Benno Krojer, Shravan Nayak, Oscar MaƱas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach,
- Abstract summary: We introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language.<n>We evaluate this method on 10 different Vision-Language Model (VLM) models.<n>We show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans.
- Score: 40.11215282864732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.
Related papers
- Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding [6.612630497074871]
Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding.<n>We propose ReVisiT, a training-free decoding method that references vision tokens to guide text generation.
arXiv Detail & Related papers (2025-06-11T08:46:55Z) - Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs [14.381188702947949]
Large Vision-Language Models (LVLMs) primarily align image features of vision encoder with Large Language Models (LLMs) to leverage their superior text generation capabilities.
This imbalance in LVLMs may result in the instances of hallucinatory.
We introduce a training-free algorithm to find an equilibrium point between image comprehension and language inference.
arXiv Detail & Related papers (2024-07-31T17:46:57Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts.
Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.