Related papers: MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

URL: http://arxiv.org/abs/2508.07833v1
Date: Mon, 11 Aug 2025 10:36:58 GMT
Title: MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
Authors: Animesh Jain, Alexandros Stergiou,
Abstract summary: We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of Vision Language Models (VLMs)<n>MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing.<n>We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts.
Score: 52.66401137323065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

Related papers

Vision-Centric Activation and Coordination for Multimodal Large Language Models [42.26911585599856]
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities.<n>However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information.<n>We introduce VaCo, which optimize MLLM representations through Vision-Centric activation and Coordination.
arXiv Detail & Related papers (2025-10-16T06:38:39Z)
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens [18.806125841573756]
VLM-Lens is designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs)<n>It provides a unified, YAML-configurable interface that abstracts away model-specific complexities.<n>VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.
arXiv Detail & Related papers (2025-10-02T17:58:41Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models [2.984679075401059]
This paper presents the Multi-Modal Explainable Learning framework, designed to enhance the interpretability of vision-language models.<n>Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities.<n>We show that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations.
arXiv Detail & Related papers (2025-09-17T18:18:59Z)
VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z)
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer [68.71557348281007]
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM)<n>Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder.<n>We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs.
arXiv Detail & Related papers (2025-04-14T17:50:20Z)
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges [6.530386181196826]
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing.<n>With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs.
arXiv Detail & Related papers (2025-01-04T04:59:33Z)
Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs.<n>It achieves enhanced efficiency, flexibility, and cross-modal understanding.<n>It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF) In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.