ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural
Language
- URL: http://arxiv.org/abs/2005.07327v2
- Date: Thu, 30 Jul 2020 07:05:00 GMT
- Title: ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural
Language
- Authors: Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang
- Abstract summary: Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.
We propose an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions.
We achieve success as well as the performance boosting by a robust feature learning.
- Score: 36.319953919737245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Person search by natural language aims at retrieving a specific person in a
large-scale image pool that matches the given textual descriptions. While most
of the current methods treat the task as a holistic visual and textual feature
matching one, we approach it from an attribute-aligning perspective that allows
grounding specific attribute phrases to the corresponding visual regions. We
achieve success as well as the performance boosting by a robust feature
learning that the referred identity can be accurately bundled by multiple
attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment
model (dubbed as ViTAA) learns to disentangle the feature space of a person
into subspaces corresponding to attributes using a light auxiliary attribute
segmentation computing branch. It then aligns these visual features with the
textual attributes parsed from the sentences by using a novel contrastive
learning loss. Upon that, we validate our ViTAA framework through extensive
experiments on tasks of person search by natural language and by
attribute-phrase queries, on which our system achieves state-of-the-art
performances. Code will be publicly available upon publication.
Related papers
- MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics.
We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Unified Text-based Person Retrieval: A Large-scale
Multi-Attribute and Language Search Benchmark [24.366997699462075]
We introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS.
Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset.
To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning framework.
arXiv Detail & Related papers (2023-06-05T14:06:24Z) - Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition.
Prior works use visual features extracted with a backbone network, pre-trained for object classification.
We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.