TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
- URL: http://arxiv.org/abs/2510.10180v1
- Date: Sat, 11 Oct 2025 11:38:01 GMT
- Title: TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
- Authors: Zixu Zhao, Yang Zhan,
- Abstract summary: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection.<n> Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief.<n>We construct the Drone Video-Text Match dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions.
- Score: 5.527227553079524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection, producing massive volumes of aerial videos. Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief. While text-video retrieval has advanced in natural video domains, the UAV domain remains underexplored due to limitations in existing datasets, such as coarse and redundant captions. Thus, in this work, we construct the Drone Video-Text Match Dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions. The annotations capture multiple complementary aspects, including human actions, objects, background settings, environmental conditions, and visual style, thereby enhancing text-video correspondence and reducing redundancy. Building on this dataset, we propose the Text-Conditioned Multi-granularity Alignment (TCMA) framework, which integrates global video-sentence alignment, sentence-guided frame aggregation, and word-guided patch alignment. To further refine local alignment, we design a Word and Patch Selection module that filters irrelevant content, as well as a Text-Adaptive Dynamic Temperature Mechanism that adapts attention sharpness to text type. Extensive experiments on DVTMD and CapERA establish the first complete benchmark for drone text-video retrieval. Our TCMA achieves state-of-the-art performance, including 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval, demonstrating the effectiveness of our dataset and method. The code and dataset will be released.
Related papers
- Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z) - VidText: Towards Comprehensive Evaluation for Video Text Understanding [56.121054697977115]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z) - VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework.<n>It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback.<n>VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)<n>First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.<n>Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - TVPR: Text-to-Video Person Retrieval and a New Benchmark [10.960048626531993]
We propose a novel Text-to-Video Person Retrieval (TVPR) task.<n>Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset.<n>We introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information.
arXiv Detail & Related papers (2023-07-14T06:34:00Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.