Related papers: ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

URL: http://arxiv.org/abs/2603.02767v2
Date: Wed, 04 Mar 2026 01:56:54 GMT
Title: ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
Authors: HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He,
Abstract summary: ITO is a framework addressing the limitation through two synergistic mechanisms.<n>We show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks.<n>Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer.
Score: 14.791563751107502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Related papers

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z)
When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning [7.598111859541752]
We identify representation geometry as a missing control axis in multimodal learning and propose regName, a lightweight geometry-aware regularization framework.<n>regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment.<n>Experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.
arXiv Detail & Related papers (2026-01-29T13:03:50Z)
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis [0.6961946145048322]
This paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion.<n>The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation.<n> evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.
arXiv Detail & Related papers (2025-07-21T11:49:57Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models [0.0]
Modality-Aware Adaptive Fusion Scheduling (MA-AFS) learns to dynamically modulate the contribution of each modality on a per-instance basis.<n>Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
arXiv Detail & Related papers (2025-06-15T05:57:45Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation. Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning [22.54577327204281]
Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions. Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
arXiv Detail & Related papers (2024-08-30T03:28:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.