Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
- URL: http://arxiv.org/abs/2312.06179v1
- Date: Mon, 11 Dec 2023 07:36:45 GMT
- Title: Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
- Authors: Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song
- Abstract summary: Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention.
Previous approaches always achieve limited performance, due to two critical factors.
We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges.
- Score: 8.683144453481328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has
attracted wide attention. However, previous approaches always achieve limited
performance, due to two critical factors are seriously overlooked. 1) The
contribution of image and text modalities is different, but incorrectly treated
equally. 2) There exist inherent labeling noises in describing users'
intentions with text in web datasets from diverse real-world scenarios, giving
rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the
above challenges, which includes three merits. First, we propose an Editable
Modality De-equalizer (EMD) by taking into account the contribution disparity
between modalities, containing two modality feature editors and an adaptive
weighted combiner. Second, to alleviate labeling noises and data bias, we
propose a dynamic soft-similarity label generator (SSG) to implicitly improve
noisy supervision. Finally, to bridge modality gaps and facilitate similarity
learning, we propose a CLIP-based mutual enhancement module alternately trained
by a mixed-modality contrastive loss. Extensive experiments verify that our
proposed model significantly outperforms state-of-the-art methods on real-world
datasets. The source code is available at
\url{https://github.com/fuxianghuang1/DWC}.
Related papers
- Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media [34.664388374279596]
We propose a Similarity-Guided Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts.
First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-language model.
We then devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference.
arXiv Detail & Related papers (2024-05-09T13:32:26Z) - A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels [22.2715520667186]
Cross-modal retrieval (CMR) aims to establish interaction between different modalities.
This work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval.
Experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-20T10:34:40Z) - Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios [23.43319138048058]
Multimodal emotion recognition (MER) in practical scenarios is significantly challenged by the presence of missing or incomplete data.
Traditional methods have often involved discarding data or substituting data segments with zero vectors to approximate these incompletenesses.
We introduce a novel noise-robust MER model that effectively learns robust multimodal joint representations from noisy data.
arXiv Detail & Related papers (2023-09-21T10:49:02Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete
Multi-view Clustering [59.77141155608009]
We propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model.
ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; and 3) it aligns the same instances in different views, thereby decreasing the impact of missing instances.
arXiv Detail & Related papers (2020-11-20T10:37:27Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.