TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval
- URL: http://arxiv.org/abs/2501.10935v1
- Date: Sun, 19 Jan 2025 04:05:08 GMT
- Title: TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval
- Authors: Shuai Lyu, Zijing Tian, Zhonghong Ou, Yifan Zhu, Xiao Zhang, Qiankun Ha, Haoran Luo, Meina Song,
- Abstract summary: Cross-modal retrieval maps data under different modality via semantic relevance.<n>Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise.<n>We introduce a Tripartite learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval.
- Score: 11.874979105806243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal retrieval maps data under different modality via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures are primarily stemmed from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce a Tripartite learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model's noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.
Related papers
- A Language Anchor-Guided Method for Robust Noisy Domain Generalization [20.83580289888522]
We introduce Anchor Alignment and Adaptive Weighting (A3W)
A3W uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features.
It consistently outperforms state-of-the-art domain generalization methods.
arXiv Detail & Related papers (2025-03-21T15:20:28Z) - Few-Shot, No Problem: Descriptive Continual Relation Extraction [27.296604792388646]
Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in real-world domains.
Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge.
We propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation.
arXiv Detail & Related papers (2025-02-27T23:44:30Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Continual Contrastive Spoken Language Understanding [33.09005399967931]
COCONUT is a class-incremental learning (CIL) method that relies on the combination of experience replay and contrastive learning.
We show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
arXiv Detail & Related papers (2023-10-04T10:09:12Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced
Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images.
We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity.
In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z) - Syntactically Robust Training on Partially-Observed Data for Open
Information Extraction [25.59133746149343]
Open Information Extraction models have shown promising results with sufficient supervision.
We propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution.
arXiv Detail & Related papers (2023-01-17T12:39:13Z) - Adversarial Dual-Student with Differentiable Spatial Warping for
Semi-Supervised Semantic Segmentation [70.2166826794421]
We propose a differentiable geometric warping to conduct unsupervised data augmentation.
We also propose a novel adversarial dual-student framework to improve the Mean-Teacher.
Our solution significantly improves the performance and state-of-the-art results are achieved on both datasets.
arXiv Detail & Related papers (2022-03-05T17:36:17Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Contrastively Disentangled Sequential Variational Autoencoder [20.75922928324671]
We propose a novel sequence representation learning method, named Contrastively Disentangled Sequential Variational Autoencoder (C-DSVAE)
We use a novel evidence lower bound which maximizes the mutual information between the input and the latent factors, while penalizes the mutual information between the static and dynamic factors.
Our experiments show that C-DSVAE significantly outperforms the previous state-of-the-art methods on multiple metrics.
arXiv Detail & Related papers (2021-10-22T23:00:32Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.