Related papers: Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation

Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation

URL: http://arxiv.org/abs/2511.22862v1
Date: Fri, 28 Nov 2025 03:33:42 GMT
Title: Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation
Authors: Jiacheng Li, Songhe Feng,
Abstract summary: Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data.<n>In multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect.<n>We propose a novel TTA framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR)
Score: 39.02105398462778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method. Our source code is available at [this URL](https://github.com/Luchicken/BriMPR).

Related papers

A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition [17.332141776831513]
Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications.<n>We propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA.
arXiv Detail & Related papers (2025-07-15T11:15:35Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation [55.486872677160015]
We reformulate multi-modal semantic segmentation as a mask-level classification task.<n>We propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA)<n> Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-06-04T08:04:58Z)
MAGIC++: Efficient and Resilient Modality-Agnostic Semantic Segmentation via Hierarchical Modality Selection [20.584588303521496]
We introduce the MAGIC++ framework, which comprises two key plug-and-play modules for effective multi-modal fusion and hierarchical modality selection.<n>Our method achieves state-of-the-art performance on both real-world and synthetic benchmarks.<n>Our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin.
arXiv Detail & Related papers (2024-12-22T06:12:03Z)
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment [37.213291617683325]
Cross-modal alignment is crucial for multimodal representation fusion.<n>We propose AlignMamba, an efficient and effective method for multimodal fusion.<n>Experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2024-12-01T14:47:41Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation [104.48766162008815]
We propose and explore a new multi-modal extension of test-time adaptation for 3D semantic segmentation. To design a framework that can take full advantage of multi-modality, each modality provides regularized self-supervisory signals to other modalities. Our regularized pseudo labels produce stable self-learning signals in numerous multi-modal test-time adaptation scenarios.
arXiv Detail & Related papers (2022-04-27T02:28:12Z)
Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities. We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.