Related papers: OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction

URL: http://arxiv.org/abs/2511.14766v1
Date: Wed, 17 Sep 2025 07:39:46 GMT
Title: OTCR: Optimal Transmission, Compression and Representation for Multimodal Information Extraction
Authors: Yang Li, Yajiao Wang, Wenhao Hu, Zhixiong Zhang, Mengting Zhang,
Abstract summary: Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents.<n>This work offers an interpretable, information-theoretic paradigm for controllable multimodal fusion in document AI.
Score: 4.245267787339966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering task-irrelevant noise to produce compact, task-adaptive representations. On FUNSD, OTCR achieves 91.95% SER and 91.13% RE, while on XFUND (ZH), it reaches 91.09% SER and 94.20% RE, demonstrating competitive performance across datasets. Feature-level analyses further confirm reduced modality redundancy and strengthened task signals. This work offers an interpretable, information-theoretic paradigm for controllable multimodal fusion in document AI.

Related papers

VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection [24.88767599022225]
Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references.<n>This study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD.
arXiv Detail & Related papers (2026-01-23T00:30:24Z)
Dual-branch Prompting for Multimodal Machine Translation [9.903997553625253]
We propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation.<n>D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model.<n>Experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-07-23T15:22:51Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a framework that generates multiple intermediate representations via dialogue refinements and DMs.<n>DAR results on par with finetuned I-TIR models, yet without incurring their tuning overhead.
arXiv Detail & Related papers (2025-01-26T03:29:18Z)
MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.<n>We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.<n>With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z)
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs [45.41083125321069]
multimodal machine translation (MMT) systems exhibit decreased sensitivity to visual information when text inputs are complete. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process.
arXiv Detail & Related papers (2023-10-26T04:13:49Z)
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing. Our method reduces the speech-text modality gap via a pre-processing stage. We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning [37.067605349559]
We propose a novel Progressive Fusion Transformer called ProFormer. It integrates single-modality information into the multimodal representation for robust RGBT tracking. ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.
arXiv Detail & Related papers (2023-03-26T16:55:58Z)
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. We propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z)
Unsupervised Multimodal Language Representations using Convolutional Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.