TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
- URL: http://arxiv.org/abs/2307.14611v3
- Date: Mon, 11 Sep 2023 05:15:31 GMT
- Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
- Authors: Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh
- Abstract summary: We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces.
TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words.
Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution.
- Score: 20.00366398989886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose TextManiA, a text-driven manifold augmentation method that
semantically enriches visual feature spaces, regardless of class distribution.
TextManiA augments visual data with intra-class semantic perturbation by
exploiting easy-to-understand visually mimetic words, i.e., attributes. This
work is built on an interesting hypothesis that general language models, e.g.,
BERT and GPT, encompass visual information to some extent, even without
training on visual training data. Given the hypothesis, TextManiA transfers
pre-trained text representation obtained from a well-established large language
encoder to a target visual feature space being learned. Our extensive analysis
hints that the language encoder indeed encompasses visual information at least
useful to augment visual representation. Our experiments demonstrate that
TextManiA is particularly powerful in scarce samples with class imbalance as
well as even distribution. We also show compatibility with the label mix-based
approaches in evenly distributed scarce data.
Related papers
- The Solution for Language-Enhanced Image New Category Discovery [5.500122875523184]
We propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts.
These prompts are for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models.
We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity.
arXiv Detail & Related papers (2024-07-06T08:09:29Z) - Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery.
The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning.
The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z) - Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples.
We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts.
Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.