Reliable Cross-modal Alignment via Prototype Iterative Construction
- URL: http://arxiv.org/abs/2510.11175v1
- Date: Mon, 13 Oct 2025 09:08:27 GMT
- Title: Reliable Cross-modal Alignment via Prototype Iterative Construction
- Authors: Xiang Ma, Litian Xu, Lexin Fang, Caiming Zhang, Lizhen Cui,
- Abstract summary: Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities.<n>Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment.<n>We propose PICO, a novel framework for suppressing style interference during embedding interaction.
- Score: 40.09297916971621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2\%-14.1\%.
Related papers
- Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach [1.5749416770494704]
It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs.<n>The method combines nested semantic representation with a contextual contrast mechanism.<n>The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity.
arXiv Detail & Related papers (2025-08-08T09:21:10Z) - Hunting Attributes: Context Prototype-Aware Learning for Weakly
Supervised Semantic Segmentation [22.591512454923883]
We argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics.
Inspired by prototype learning theory, we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances.
We present a Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic context to enrich instance comprehension.
arXiv Detail & Related papers (2024-03-12T13:11:58Z) - Separating common from salient patterns with Contrastive Representation
Learning [2.250968907999846]
Contrastive Analysis aims at separating common factors of variation between two datasets.
Current models based on Variational Auto-Encoders have shown poor performance in learning semantically-expressive representations.
We propose to leverage the ability of Contrastive Learning to learn semantically expressive representations well adapted for Contrastive Analysis.
arXiv Detail & Related papers (2024-02-19T08:17:13Z) - Any-Way Meta Learning [27.16222034423108]
We introduce the any-way" learning paradigm, an innovative model training approach that liberates model from fixed cardinality constraints.
Surprisingly, this model not only matches but often outperforms traditional fixed-way models in terms of performance, convergence speed, and stability.
arXiv Detail & Related papers (2024-01-10T12:00:53Z) - Beyond Prototypes: Semantic Anchor Regularization for Better
Representation Learning [82.29761875805369]
One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes.
We propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning.
The proposed Semantic Anchor Regularization (SAR) can be used in a plug-and-play manner in the existing models.
arXiv Detail & Related papers (2023-12-19T05:52:38Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Fixing confirmation bias in feature attribution methods via semantic
match [4.733072355085082]
We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions.
This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations.
arXiv Detail & Related papers (2023-07-03T09:50:08Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality
Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable.
We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - Prototypical Representation Learning for Relation Extraction [56.501332067073065]
This paper aims to learn predictive, interpretable, and robust relation representations from distantly-labeled data.
We learn prototypes for each relation from contextual information to best explore the intrinsic semantics of relations.
Results on several relation learning tasks show that our model significantly outperforms the previous state-of-the-art relational models.
arXiv Detail & Related papers (2021-03-22T08:11:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.