A quantitative analysis of semantic information in deep representations of text and images
- URL: http://arxiv.org/abs/2505.17101v3
- Date: Sat, 04 Oct 2025 07:30:20 GMT
- Title: A quantitative analysis of semantic information in deep representations of text and images
- Authors: Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio,
- Abstract summary: We present a method for measuring the relative information content of the representations of semantically related data.<n>We probe how it is encoded into multiple tokens of large language models (LLMs) and vision transformers.<n>We observe significant and model-dependent information asymmetries between image and text representations.
- Score: 42.597592429757746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.
Related papers
- Differential syntactic and semantic encoding in LLMs [49.300174325011426]
We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs)<n>We find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled.
arXiv Detail & Related papers (2026-01-08T09:33:29Z) - Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z) - Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment [25.209622555403527]
We propose a framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation.<n>AVSE calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities.<n>Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets.
arXiv Detail & Related papers (2025-03-10T06:38:41Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation [7.742746565876165]
The interpretability of LVLMs remains an under-explored area.<n>In models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence.<n>We propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs.
arXiv Detail & Related papers (2024-12-13T03:13:44Z) - Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models [18.87130615326443]
Vision-language models (VLMs) serve as foundation models for image captioning and text-to-image generation.<n>Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding.
arXiv Detail & Related papers (2024-12-11T05:37:04Z) - The Narrow Gate: Localized Image-Text Communication in Vision-Language Models [36.33608889682152]
This study investigates how vision-language models handle image-understanding tasks.<n>We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream.<n>In contrast, models trained for image and text generation tend to rely on a single token that acts as a narrow gate for visual information.
arXiv Detail & Related papers (2024-12-09T16:39:40Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Linear Alignment of Vision-language Models for Image Captioning [8.921774238325566]
We propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods.<n>We also propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment.
arXiv Detail & Related papers (2023-07-10T17:59:21Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.