Related papers: Text-based Aerial-Ground Person Retrieval

Text-based Aerial-Ground Person Retrieval

URL: http://arxiv.org/abs/2511.08369v1
Date: Wed, 12 Nov 2025 01:55:55 GMT
Title: Text-based Aerial-Ground Person Retrieval
Authors: Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye,
Abstract summary: This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR)<n>It aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions.
Score: 55.31140361809554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.

Related papers

GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval [12.483996028288407]
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions.<n>To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective.<n>We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA.
arXiv Detail & Related papers (2025-11-13T10:06:41Z)
Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval [66.61856014573742]
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Previous methods have attempted to align text and image samples in a modal-shared space. We propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample.
arXiv Detail & Related papers (2024-06-09T03:06:55Z)
TAGA: Text-Attributed Graph Self-Supervised Learning by Synergizing Graph and Text Mutual Transformations [15.873944819608434]
Text-Attributed Graphs (TAGs) enhance graph structures with natural language descriptions. This paper introduces a new self-supervised learning framework, Text-And-Graph Multi-View Alignment (TAGA), which integrates TAGs' structural and semantic dimensions. Our framework demonstrates strong performance in zero-shot and few-shot scenarios across eight real-world datasets.
arXiv Detail & Related papers (2024-05-27T03:40:16Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [12.057465578064345]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.<n>We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Learning Semantic-Aligned Feature Representation for Text-based Person Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search. The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks. To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.