ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
- URL: http://arxiv.org/abs/2402.17298v2
- Date: Fri, 22 Nov 2024 16:05:12 GMT
- Title: ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
- Authors: Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin,
- Abstract summary: In this paper, we replace the image input with text for Vision-language training.
Inspired by prior noise injection methods, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin)
Our empirical results demonstrate that these models closely rival those trained on images in terms of performance.
- Score: 43.42682181017004
- License:
- Abstract: "A data scientist is tasked with developing a low-cost surgical VQA system for a 2-month workshop. Due to data sensitivity, she collects 50 hours of surgical video from a hospital, requiring two months for privacy approvals. Privacy restrictions prevent uploading data to platforms like ChatGPT, so she assembles one annotator and a medical expert to manually create QA pairs. This process takes three weeks and costs over $10,000. The trained model provides accurate responses within the limited data scope but lacks broader generalizability, completing the project in 3 months." To simplify the challenges presented in the scenario above. In this paper, we replace the image input with text for Vision-language training. Inspired by prior noise injection methods to reduce modality gaps, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively broadens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 0.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks.
Related papers
- CAPEEN: Image Captioning with Early Exits and Knowledge Distillation [5.402030962296633]
Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks.
EE strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions.
We introduce CAPEEN to improve the performance of EE strategies using knowledge distillation.
arXiv Detail & Related papers (2024-10-06T10:05:01Z) - Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression [24.119415458653616]
We propose a novel unified pruning framework Comb, Prune, Distill (CPD) to address both model-agnostic and task-agnostic concerns simultaneously.
Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence.
In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.
arXiv Detail & Related papers (2024-08-06T09:02:31Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - GeNIe: Generative Hard Negative Images Through Diffusion [16.619150568764262]
Recent advances in generative AI have enabled more sophisticated augmentation techniques that produce data resembling natural images.
We introduce GeNIe, a novel augmentation method which leverages a latent diffusion model conditioned on a text prompt to generate challenging augmentations.
Our experiments demonstrate the effectiveness of our novel augmentation method and its superior performance over the prior art.
arXiv Detail & Related papers (2023-12-05T07:34:30Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.