Related papers: LDCA: Local Descriptors with Contextual Augmentation for Few-Shot Learning

LDCA: Local Descriptors with Contextual Augmentation for Few-Shot Learning

URL: http://arxiv.org/abs/2401.13499v1
Date: Wed, 24 Jan 2024 14:44:48 GMT
Title: LDCA: Local Descriptors with Contextual Augmentation for Few-Shot Learning
Authors: Maofa Wang and Bingchen Yan
Abstract summary: We introduce a novel approach termed "Local Descriptor with Contextual Augmentation (LDCA)" LDCA bridges the gap between local and global understanding by leveraging an adaptive global contextual enhancement module. Experiments underscore the efficacy of our method, showing a maximal absolute improvement of 20% over the next-best on fine-grained classification datasets.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot image classification has emerged as a key challenge in the field of computer vision, highlighting the capability to rapidly adapt to new tasks with minimal labeled data. Existing methods predominantly rely on image-level features or local descriptors, often overlooking the holistic context surrounding these descriptors. In this work, we introduce a novel approach termed "Local Descriptor with Contextual Augmentation (LDCA)". Specifically, this method bridges the gap between local and global understanding uniquely by leveraging an adaptive global contextual enhancement module. This module incorporates a visual transformer, endowing local descriptors with contextual awareness capabilities, ranging from broad global perspectives to intricate surrounding nuances. By doing so, LDCA transcends traditional descriptor-based approaches, ensuring each local feature is interpreted within its larger visual narrative. Extensive experiments underscore the efficacy of our method, showing a maximal absolute improvement of 20\% over the next-best on fine-grained classification datasets, thus demonstrating significant advancements in few-shot classification tasks.

Related papers

Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework [16.961220047066792]
ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
arXiv Detail & Related papers (2025-08-02T12:14:29Z)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach [43.419607730361996]
Vision-Language Models (VLMs) like CLIP achieve cross-modal alignment through contrastive learning.<n>Traditional prompt engineering relies on coarse-grained category labels, neglecting fine-grained local semantics.<n>We propose a plug-and-play solution that enables CLIP to process localized visual descriptors.
arXiv Detail & Related papers (2025-07-04T10:24:26Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the learning of global and local prompts, effectively detecting abnormal patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains.
arXiv Detail & Related papers (2024-11-09T05:22:13Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model [61.389233691596004]
We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
arXiv Detail & Related papers (2024-07-07T13:06:34Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Simple Image-level Classification Improves Open-vocabulary Object Detection [27.131298903486474]
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale
arXiv Detail & Related papers (2023-12-16T13:06:15Z)
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding. VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training. Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z)
Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach. We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z)
Gait Recognition via Effective Global-Local Feature Representation and Local Temporal Aggregation [28.721376937882958]
Gait recognition is one of the most important biometric technologies and has been applied in many fields. Recent gait recognition frameworks represent each gait frame by descriptors extracted from either global appearances or local regions of humans. We propose a novel feature extraction and fusion framework to achieve discriminative feature representations for gait recognition.
arXiv Detail & Related papers (2020-11-03T04:07:13Z)
Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions. We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z)
Weakly-supervised Object Localization for Few-shot Learning and Fine-grained Few-shot Learning [0.5156484100374058]
Few-shot learning aims to learn novel visual categories from very few samples. We propose a Self-Attention Based Complementary Module (SAC Module) to fulfill the weakly-supervised object localization. We also produce the activated masks for selecting discriminative deep descriptors for few-shot classification.
arXiv Detail & Related papers (2020-03-02T14:07:05Z)
Global Context-Aware Progressive Aggregation Network for Salient Object Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features. We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.