HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
- URL: http://arxiv.org/abs/2508.21539v1
- Date: Fri, 29 Aug 2025 11:50:24 GMT
- Title: HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
- Authors: Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li,
- Abstract summary: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation.<n>The wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding.<n>We propose the Hierarchical Cross-Granularity Contrastive and Matching learning framework with two components.
- Score: 29.663691563826095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
Related papers
- Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning [56.6025512458557]
Motion-language retrieval aims to bridge the semantic gap between natural language and human motion.<n>Existing approaches predominantly focus on aligning entire motion sequences with global textual representations.<n>We propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval.
arXiv Detail & Related papers (2026-01-29T16:00:12Z) - HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models [63.87966115136411]
HarmoCLIP is a novel framework designed to harmonize global and region representations within Contrastive Language-Image Pre-training.<n>To strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy.
arXiv Detail & Related papers (2025-11-27T16:24:53Z) - Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments [0.0]
This work proposes a vision-language integration framework that unifies pre-trained visual encoders and large language models.<n>The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics.
arXiv Detail & Related papers (2025-10-29T01:16:21Z) - GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework designed for robust visual speech recognition (VSR)<n>GLip learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data.<n>In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context.
arXiv Detail & Related papers (2025-09-19T14:36:01Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation [27.770224730465237]
We propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation.<n>HCMA integrates two alignment modules into each diffusion sampling step.<n>Experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-10T05:02:58Z) - Universal Scene Graph Generation [77.53076485727414]
We present Universal Universal SG (USG), a novel representation capable of characterizing comprehensive semantic scenes.<n>We also introduce USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges.
arXiv Detail & Related papers (2025-03-19T08:55:06Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - Region-level Contrastive and Consistency Learning for Semi-Supervised
Semantic Segmentation [30.1884540364192]
We propose a novel region-level contrastive and consistency learning framework (RC2L) for semi-supervised semantic segmentation.
Specifically, we first propose a Region Mask Contrastive (RMC) loss and a Region Feature Contrastive (RFC) loss to accomplish region-level contrastive property.
Based on the proposed region-level contrastive and consistency regularization, we develop a region-level contrastive and consistency learning framework (RC2L) for semi-supervised semantic segmentation.
arXiv Detail & Related papers (2022-04-28T07:22:47Z) - Similarity Reasoning and Filtration for Image-Text Matching [85.68854427456249]
We propose a novel Similarity Graph Reasoning and Attention filtration network for image-text matching.
Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments.
We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets.
arXiv Detail & Related papers (2021-01-05T06:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.