BiCro: Noisy Correspondence Rectification for Multi-modality Data via
Bi-directional Cross-modal Similarity Consistency
- URL: http://arxiv.org/abs/2303.12419v2
- Date: Thu, 8 Jun 2023 09:36:40 GMT
- Title: BiCro: Noisy Correspondence Rectification for Multi-modality Data via
Bi-directional Cross-modal Similarity Consistency
- Authors: Shuo Yang, Zhaopan Xu, Kai Wang, Yang You, Hongxun Yao, Tongliang Liu,
Min Xu
- Abstract summary: BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree.
experiments on three popular cross-modal matching datasets demonstrate that BiCro significantly improves the noise-robustness of various matching models.
- Score: 66.8685113725007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As one of the most fundamental techniques in multimodal learning, cross-modal
matching aims to project various sensory modalities into a shared feature
space. To achieve this, massive and correctly aligned data pairs are required
for model training. However, unlike unimodal datasets, multimodal datasets are
extremely harder to collect and annotate precisely. As an alternative, the
co-occurred data pairs (e.g., image-text pairs) collected from the Internet
have been widely exploited in the area. Unfortunately, the cheaply collected
dataset unavoidably contains many mismatched data pairs, which have been proven
to be harmful to the model's performance. To address this, we propose a general
framework called BiCro (Bidirectional Cross-modal similarity consistency),
which can be easily integrated into existing cross-modal matching models and
improve their robustness against noisy data. Specifically, BiCro aims to
estimate soft labels for noisy data pairs to reflect their true correspondence
degree. The basic idea of BiCro is motivated by that -- taking image-text
matching as an example -- similar images should have similar textual
descriptions and vice versa. Then the consistency of these two similarities can
be recast as the estimated soft labels to train the matching model. The
experiments on three popular cross-modal matching datasets demonstrate that our
method significantly improves the noise-robustness of various matching models,
and surpass the state-of-the-art by a clear margin.
Related papers
- Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching [10.709744162565274]
We propose a novel method called DIAS to bridge the modality gap from two aspects.
The method achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.
arXiv Detail & Related papers (2024-10-22T09:37:29Z) - A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels [22.2715520667186]
Cross-modal retrieval (CMR) aims to establish interaction between different modalities.
This work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval.
Experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-20T10:34:40Z) - Dynamic Weighted Combiner for Mixed-Modal Image Retrieval [8.683144453481328]
Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention.
Previous approaches always achieve limited performance, due to two critical factors.
We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges.
arXiv Detail & Related papers (2023-12-11T07:36:45Z) - Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters.
Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z) - Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z) - Noisy Correspondence Learning with Meta Similarity Correction [22.90696057856008]
multimodal learning relies on correct correspondence among multimedia data.
Most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs.
We propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores.
arXiv Detail & Related papers (2023-04-13T05:20:45Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval [7.459223771397159]
Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared.
To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels.
We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
arXiv Detail & Related papers (2022-11-07T10:37:14Z) - Multi-View Correlation Consistency for Semi-Supervised Semantic
Segmentation [59.34619548026885]
Semi-supervised semantic segmentation needs rich and robust supervision on unlabeled data.
We propose a view-coherent data augmentation strategy that guarantees pixel-pixel correspondence between different views.
In a series of semi-supervised settings on two datasets, we report competitive accuracy compared with the state-of-the-art methods.
arXiv Detail & Related papers (2022-08-17T17:59:11Z) - Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas.
We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.