Related papers: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

URL: http://arxiv.org/abs/2307.14611v3
Date: Mon, 11 Sep 2023 05:15:31 GMT
Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
Authors: Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh
Abstract summary: We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution.
Score: 20.00366398989886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.

Related papers

Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation [48.76848912120607]
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation.<n>We propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg)<n>Our framework consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA)
arXiv Detail & Related papers (2025-07-16T16:29:30Z)
Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data [51.57559025799189]
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization.
arXiv Detail & Related papers (2025-01-16T13:16:37Z)
Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained backbone CLIP to learn basic multimodal features. It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation [23.60337935010744]
We present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques.
arXiv Detail & Related papers (2024-12-16T10:53:24Z)
The Solution for Language-Enhanced Image New Category Discovery [5.500122875523184]
We propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity.
arXiv Detail & Related papers (2024-07-06T08:09:29Z)
Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z)
Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning. The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples. We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts. Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z)
Brief Introduction to Contrastive Learning Pretext Tasks for Visual Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods. The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not. We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.