Related papers: ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

URL: http://arxiv.org/abs/2402.17298v1
Date: Tue, 27 Feb 2024 08:20:45 GMT
Title: ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Authors: Yang Liu, Xiaomin Yu, Gongyu Zhang, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Abstract summary: We address the challenging task of bridging the modality gap between learning from language and inference for visual tasks. We propose a novel method called Adaptive ranged cosine Similarity injected noise (ArcSin) Our empirical results demonstrate that these models closely rival those trained on images in terms of performance.
Score: 45.23955785457727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we address the challenging task of bridging the modality gap between learning from language and inference for visual tasks, including Visual Question Answering (VQA), Image Captioning (IC) and Visual Entailment (VE). We train models for these tasks in a zero-shot cross-modal transfer setting, a domain where the previous state-of-the-art method relied on the fixed scale noise injection, often compromising the semantic content of the original modality embedding. To combat it, we propose a novel method called Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively widens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 1.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks. The code will be released.

Related papers

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation [20.009670139005085]
Existing ultrasound segmentation methods often struggle with adaptability to new tasks. We introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features. These enriched features are then decoded to produce precise and robust segmentation.
arXiv Detail & Related papers (2025-03-31T17:47:42Z)
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI [58.35348718345307]
Current efforts to distinguish between real and AI-generated images may lack generalization. We propose a novel framework, Co-Spy, that first enhances existing semantic features. We also create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models.
arXiv Detail & Related papers (2025-03-24T01:59:29Z)
CAPEEN: Image Captioning with Early Exits and Knowledge Distillation [5.402030962296633]
Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. EE strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. We introduce CAPEEN to improve the performance of EE strategies using knowledge distillation.
arXiv Detail & Related papers (2024-10-06T10:05:01Z)
Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression [24.119415458653616]
We propose a novel unified pruning framework Comb, Prune, Distill (CPD) to address both model-agnostic and task-agnostic concerns simultaneously. Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence. In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.
arXiv Detail & Related papers (2024-08-06T09:02:31Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
GeNIe: Generative Hard Negative Images Through Diffusion [16.619150568764262]
Recent advances in generative AI have enabled more sophisticated augmentation techniques that produce data resembling natural images. We introduce GeNIe, a novel augmentation method which leverages a latent diffusion model conditioned on a text prompt to generate challenging augmentations. Our experiments demonstrate the effectiveness of our novel augmentation method and its superior performance over the prior art.
arXiv Detail & Related papers (2023-12-05T07:34:30Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks. The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences. Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z)
Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses. We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.