ArcSin: Adaptive ranged cosine Similarity injected noise for
Language-Driven Visual Tasks
- URL: http://arxiv.org/abs/2402.17298v1
- Date: Tue, 27 Feb 2024 08:20:45 GMT
- Title: ArcSin: Adaptive ranged cosine Similarity injected noise for
Language-Driven Visual Tasks
- Authors: Yang Liu, Xiaomin Yu, Gongyu Zhang, Christos Bergeles, Prokar
Dasgupta, Alejandro Granados, Sebastien Ourselin
- Abstract summary: We address the challenging task of bridging the modality gap between learning from language and inference for visual tasks.
We propose a novel method called Adaptive ranged cosine Similarity injected noise (ArcSin)
Our empirical results demonstrate that these models closely rival those trained on images in terms of performance.
- Score: 45.23955785457727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we address the challenging task of bridging the modality gap
between learning from language and inference for visual tasks, including Visual
Question Answering (VQA), Image Captioning (IC) and Visual Entailment (VE). We
train models for these tasks in a zero-shot cross-modal transfer setting, a
domain where the previous state-of-the-art method relied on the fixed scale
noise injection, often compromising the semantic content of the original
modality embedding. To combat it, we propose a novel method called Adaptive
ranged cosine Similarity injected noise (ArcSin). First, we introduce an
innovative adaptive noise scale that effectively generates the textual elements
with more variability while preserving the original text feature's integrity.
Second, a similarity pool strategy is employed, expanding the domain
generalization potential by broadening the overall noise scale. This dual
strategy effectively widens the scope of the original domain while safeguarding
content integrity. Our empirical results demonstrate that these models closely
rival those trained on images in terms of performance. Specifically, our method
exhibits substantial improvements over the previous state-of-the-art, achieving
gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively.
Additionally, we observe increases of 1.5 percentage points (pp), 1.4 pp, and
1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries
of what is achievable within the constraints of image-trained model benchmarks.
The code will be released.
Related papers
- Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Intra-Source Style Augmentation for Improved Domain Generalization [21.591831983223997]
We propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation.
ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers.
It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by $3%$ mIoU in Cityscapes to Dark Z"urich.
arXiv Detail & Related papers (2022-10-18T21:33:25Z) - DFM: A Performance Baseline for Deep Feature Matching [10.014010310188821]
The proposed method uses pre-trained VGG architecture as a feature extractor and does not require any additional training specific to improve matching.
Our algorithm achieves 0.57 and 0.80 overall scores in terms of Mean Matching Accuracy (MMA) for 1 pixel and 2 pixels thresholds respectively on Hpatches dataset.
arXiv Detail & Related papers (2021-06-14T22:55:06Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Pixel-Level Cycle Association: A New Perspective for Domain Adaptive
Semantic Segmentation [169.82760468633236]
We propose to build the pixel-level cycle association between source and target pixel pairs.
Our method can be trained end-to-end in one stage and introduces no additional parameters.
arXiv Detail & Related papers (2020-10-31T00:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.