MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
- URL: http://arxiv.org/abs/2508.07833v1
- Date: Mon, 11 Aug 2025 10:36:58 GMT
- Title: MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
- Authors: Animesh Jain, Alexandros Stergiou,
- Abstract summary: We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of Vision Language Models (VLMs)<n>MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing.<n>We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts.
- Score: 52.66401137323065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
Related papers
- Vision-Centric Activation and Coordination for Multimodal Large Language Models [42.26911585599856]
Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities.<n>However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information.<n>We introduce VaCo, which optimize MLLM representations through Vision-Centric activation and Coordination.
arXiv Detail & Related papers (2025-10-16T06:38:39Z) - From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens [18.806125841573756]
VLM-Lens is designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs)<n>It provides a unified, YAML-configurable interface that abstracts away model-specific complexities.<n>VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.
arXiv Detail & Related papers (2025-10-02T17:58:41Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models [2.984679075401059]
This paper presents the Multi-Modal Explainable Learning framework, designed to enhance the interpretability of vision-language models.<n>Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities.<n>We show that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations.
arXiv Detail & Related papers (2025-09-17T18:18:59Z) - VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z) - The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer [68.71557348281007]
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM)<n>Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder.<n>We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs.
arXiv Detail & Related papers (2025-04-14T17:50:20Z) - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges [6.530386181196826]
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing.<n>With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs.
arXiv Detail & Related papers (2025-01-04T04:59:33Z) - Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs.<n>It achieves enhanced efficiency, flexibility, and cross-modal understanding.<n>It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.