Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization
- URL: http://arxiv.org/abs/2010.12126v2
- Date: Tue, 27 Oct 2020 23:42:21 GMT
- Title: Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization
- Authors: Li Ren, Kai Li, LiQiang Wang, Kien Hua
- Abstract summary: We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs.
Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
- Score: 21.904563910555368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Matching information across image and text modalities is a fundamental
challenge for many applications that involve both vision and natural language
processing. The objective is to find efficient similarity metrics to compare
the similarity between visual and textual information. Existing approaches
mainly match the local visual objects and the sentence words in a shared space
with attention mechanisms. The matching performance is still limited because
the similarity computation is based on simple comparisons of the matching
features, ignoring the characteristics of their distribution in the data. In
this paper, we address this limitation with an efficient learning objective
that considers the discriminative feature distributions between the visual
objects and sentence words. Specifically, we propose a novel Adversarial
Discriminative Domain Regularization (ADDR) learning framework, beyond the
paradigm metric learning objective, to construct a set of discriminative data
domains within each image-text pairs. Our approach can generally improve the
learning efficiency and the performance of existing metrics learning frameworks
by regulating the distribution of the hidden space between the matching pairs.
The experimental results show that this new approach significantly improves the
overall performance of several popular cross-modal matching techniques (SCAN,
VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.
Related papers
- Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - DEMO: A Statistical Perspective for Efficient Image-Text Matching [32.256725860652914]
We introduce Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching.
DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution.
In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions.
arXiv Detail & Related papers (2024-05-19T09:38:56Z) - Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking [0.5242869847419834]
We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
arXiv Detail & Related papers (2023-09-15T04:39:11Z) - VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification [3.7798600249187295]
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues.
The proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities.
arXiv Detail & Related papers (2022-05-24T12:28:12Z) - Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval.
We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions.
Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z) - Cross-Domain Facial Expression Recognition: A Unified Evaluation
Benchmark and Adversarial Graph Learning [85.6386289476598]
We develop a novel adversarial graph representation adaptation (AGRA) framework for cross-domain holistic-local feature co-adaptation.
We conduct extensive and fair evaluations on several popular benchmarks and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2020-08-03T15:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.