Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching
- URL: http://arxiv.org/abs/2404.18114v1
- Date: Sun, 28 Apr 2024 08:44:28 GMT
- Title: Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching
- Authors: Haiwen Diao, Ying Zhang, Shang Gao, Xiang Ruan, Huchuan Lu,
- Abstract summary: We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
- Score: 53.05954114863596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method. Our code is publicly available at: https://github.com/Paranioar/DBL.
Related papers
- GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training [17.27516384073838]
We propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning.
CMAL achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks.
arXiv Detail & Related papers (2024-10-16T14:12:26Z) - Synergistic Anchored Contrastive Pre-training for Few-Shot Relation
Extraction [4.7220779071424985]
Few-shot Relation Extraction (FSRE) aims to extract facts from a sparse set of labeled corpora.
Recent studies have shown promising results in FSRE by employing Pre-trained Language Models.
We introduce a novel synergistic anchored contrastive pre-training framework.
arXiv Detail & Related papers (2023-12-19T10:16:24Z) - Nearest Neighbor-Based Contrastive Learning for Hyperspectral and LiDAR
Data Classification [45.026868970899514]
We propose a Nearest Neighbor-based Contrastive Learning Network (NNCNet) to learn discriminative feature representations.
Specifically, we propose a nearest neighbor-based data augmentation scheme to use enhanced semantic relationships among nearby regions.
In addition, we design a bilinear attention module to exploit the second-order and even high-order feature interactions between the HSI and LiDAR data.
arXiv Detail & Related papers (2023-01-09T13:43:54Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval.
We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions.
Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z) - An unsupervised deep learning framework via integrated optimization of
representation learning and GMM-based modeling [31.334196673143257]
This paper introduces a new principle of joint learning on both deep representations and GMM-based deep modeling.
In comparison with the existing work in similar areas, our objective function has two learning targets, which are created to be jointly optimized.
The compactness of clusters is significantly enhanced by reducing the intra-cluster distances, and the separability is improved by increasing the inter-cluster distances.
arXiv Detail & Related papers (2020-09-11T04:57:03Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.