IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation
- URL: http://arxiv.org/abs/2510.10969v1
- Date: Mon, 13 Oct 2025 03:19:45 GMT
- Title: IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation
- Authors: Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang,
- Abstract summary: IUT-Plug is a module grounded in an Image Understanding Tree (IUT)<n>A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures.<n>A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency.
- Score: 23.61167100602915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.
Related papers
- Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification [23.27053860130735]
Unsupervised multi-scenario (UMS) person re-identification (ReID) is a new task that expands ReID across diverse scenarios.<n>We introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models.
arXiv Detail & Related papers (2026-01-16T12:45:01Z) - Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding [11.244257545057508]
Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
arXiv Detail & Related papers (2025-09-08T02:27:10Z) - Visual Semantic Description Generation with MLLMs for Image-Text Matching [7.246705430021142]
We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantics.<n>Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency.
arXiv Detail & Related papers (2025-07-11T13:38:01Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection [18.125287697902813]
Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data.<n>We present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection.
arXiv Detail & Related papers (2025-03-19T03:20:03Z) - IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.