VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
- URL: http://arxiv.org/abs/2311.07514v1
- Date: Mon, 13 Nov 2023 17:56:54 GMT
- Title: VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
- Authors: Shuting He, Hao Luo, Wei Jiang, Xudong Jiang, Henghui Ding
- Abstract summary: We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search.
In VGSG, a vision-guided attention is employed to extract visual-related textual features.
With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
- Score: 51.9899504535878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based Person Search (TBPS) aims to retrieve images of target pedestrian
indicated by textual descriptions. It is essential for TBPS to extract
fine-grained local features and align them crossing modality. Existing methods
utilize external tools or heavy cross-modal interaction to achieve explicit
alignment of cross-modal fine-grained features, which is inefficient and
time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network
(VGSG) for text-based person search to extract well-aligned fine-grained visual
and textual features. In the proposed VGSG, we develop a Semantic-Group Textual
Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to
extract textual local features under the guidance of visual local clues. In
SGTL, in order to obtain the local textual representation, we group textual
features from the channel dimension based on the semantic cues of language
expression, which encourages similar semantic patterns to be grouped implicitly
without external tools. In VGKT, a vision-guided attention is employed to
extract visual-related textual features, which are inherently aligned with
visual cues and termed vision-guided textual features. Furthermore, we design a
relational knowledge transfer, including a vision-language similarity transfer
and a class probability transfer, to adaptively propagate information of the
vision-guided textual features to semantic-group textual features. With the
help of relational knowledge transfer, VGKT is capable of aligning
semantic-group textual features with corresponding visual features without
external tools and complex pairwise interaction. Experimental results on two
challenging benchmarks demonstrate its superiority over state-of-the-art
methods.
Related papers
- Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval [7.118271398274512]
We propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language.
Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation.
We exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations.
arXiv Detail & Related papers (2023-10-12T12:28:47Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning.
The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning.
The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - MUTATT: Visual-Textual Mutual Guidance for Referring Expression
Comprehension [16.66775734538439]
Referring expression comprehension aims to localize a text-related region in a given image by a referring expression in natural language.
We argue that for REC the referring expression and the target region are semantically correlated.
We propose a novel approach called MutAtt to construct mutual guidance between vision and language.
arXiv Detail & Related papers (2020-03-18T03:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.