PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
- URL: http://arxiv.org/abs/2409.13475v1
- Date: Fri, 20 Sep 2024 13:05:55 GMT
- Title: PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
- Authors: Jicheol Park, Dongwon Kim, Boseung Jeong, Suha Kwak,
- Abstract summary: We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities.
Our method is evaluated on three public benchmarks, significantly outperforming existing methods.
- Score: 29.301950609839796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.
Related papers
- From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification [4.400729890122927]
The aim of text-based person Re-ID is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions.
There is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective.
We introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task.
arXiv Detail & Related papers (2024-07-31T18:16:18Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - A Context-Contrastive Inference Approach To Partial Diacritization [0.5575959989491791]
Diacritization plays a pivotal role in improving readability and disambiguating the meaning of Arabic texts.
Partial Diacritzation (PD) is the selection of a subset of characters to be marked to aid comprehension where needed.
We introduce Context-Contrastive Partial Diacritization (CCPD) -- a novel approach to PD which integrates seamlessly with existing Arabic diacritization systems.
arXiv Detail & Related papers (2024-01-17T02:04:59Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - Attention-based Feature Decomposition-Reconstruction Network for Scene
Text Detection [20.85468268945721]
We propose attention-based feature decomposition-reconstruction network for scene text detection.
We use contextual information and low-level feature to enhance the performance of segmentation-based text detector.
Experiments have been conducted on two public benchmark datasets and results show that our proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-11-29T06:15:25Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.