Related papers: GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

URL: http://arxiv.org/abs/2511.10154v1
Date: Fri, 14 Nov 2025 01:35:34 GMT
Title: GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval
Authors: Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou,
Abstract summary: Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions.<n>To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective.<n>We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA.
Score: 12.483996028288407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.

Related papers

Unified Text-Image Generation with Weakness-Targeted Post-Training [57.956648078400775]
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis.<n>This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis.
arXiv Detail & Related papers (2026-01-07T19:19:44Z)
Text-based Aerial-Ground Person Retrieval [55.31140361809554]
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR)<n>It aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions.
arXiv Detail & Related papers (2025-11-11T15:49:04Z)
Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective [23.51937497342985]
Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities.<n>We propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image.<n>We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset.
arXiv Detail & Related papers (2025-11-09T08:37:46Z)
TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion [55.34830989105704]
Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities.<n>We introduce textual semantics at two levels: the mask semantic level and the text semantic level.<n>We propose Textual Semantic Guidance for infrared and visible image fusion, which guides the image synthesis process.
arXiv Detail & Related papers (2025-06-20T03:53:07Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [12.057465578064345]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.<n>We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
Towards Better Text-Image Consistency in Text-to-Image Generation [15.735515302139335]
We develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD) We further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which can fuse semantic information at different granularities. Our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
arXiv Detail & Related papers (2022-10-27T07:47:47Z)
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval [142.047662926209]
We propose a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. We generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets.
arXiv Detail & Related papers (2022-07-29T01:21:54Z)
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators. We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.