Cross-modal Active Complementary Learning with Self-refining
Correspondence
- URL: http://arxiv.org/abs/2310.17468v2
- Date: Mon, 8 Jan 2024 02:20:49 GMT
- Title: Cross-modal Active Complementary Learning with Self-refining
Correspondence
- Authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu
- Abstract summary: We propose a Cross-modal Robust Complementary Learning framework (CRCL) to improve the robustness of existing methods.
ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision.
SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences.
- Score: 54.61307946222386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, image-text matching has attracted more and more attention from
academia and industry, which is fundamental to understanding the latent
correspondence across visual and textual modalities. However, most existing
methods implicitly assume the training pairs are well-aligned while ignoring
the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby
inevitably leading to a performance drop. Although some methods attempt to
address such noise, they still face two challenging problems: excessive
memorizing/overfitting and unreliable correction for NC, especially under high
noise. To address the two problems, we propose a generalized Cross-modal Robust
Complementary Learning framework (CRCL), which benefits from a novel Active
Complementary Loss (ACL) and an efficient Self-refining Correspondence
Correction (SCC) to improve the robustness of existing methods. Specifically,
ACL exploits active and complementary learning losses to reduce the risk of
providing erroneous supervision, leading to theoretically and experimentally
demonstrated robustness against NC. SCC utilizes multiple self-refining
processes with momentum correction to enlarge the receptive field for
correcting correspondences, thereby alleviating error accumulation and
achieving accurate and stable corrections. We carry out extensive experiments
on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify
the superior robustness of our CRCL against synthetic and real-world noisy
correspondences.
Related papers
- Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Confidence-Aware Document OCR Error Detection [1.003485566379789]
We analyze the correlation between confidence scores and error rates across different OCR systems.
We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings.
arXiv Detail & Related papers (2024-09-06T08:35:28Z) - Disentangled Noisy Correspondence Learning [56.06801962154915]
Cross-modal retrieval is crucial in understanding latent correspondences across modalities.
DisNCL is a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning.
arXiv Detail & Related papers (2024-08-10T09:49:55Z) - Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems.
We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z) - Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation [21.345548821276097]
Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels.
We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online.
CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision.
arXiv Detail & Related papers (2024-02-27T21:08:23Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - When Does Contrastive Learning Preserve Adversarial Robustness from
Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework.
We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.