BiLMa: Bidirectional Local-Matching for Text-based Person
Re-identification
- URL: http://arxiv.org/abs/2309.04675v1
- Date: Sat, 9 Sep 2023 04:01:24 GMT
- Title: BiLMa: Bidirectional Local-Matching for Text-based Person
Re-identification
- Authors: Takuro Fujii and Shuhei Tarashima
- Abstract summary: Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query.
How to effectively align images and texts globally and locally is a crucial challenge.
We introduce Bidirectional Local-Matching (LMa) framework that jointly optimize Masked Image Modeling (MIM) in TBPReID model training.
- Score: 2.3931689873603603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based person re-identification (TBPReID) aims to retrieve person images
represented by a given textual query. In this task, how to effectively align
images and texts globally and locally is a crucial challenge. Recent works have
obtained high performances by solving Masked Language Modeling (MLM) to align
image/text parts. However, they only performed uni-directional (i.e., from
image to text) local-matching, leaving room for improvement by introducing
opposite-directional (i.e., from text to image) local-matching. In this work,
we introduce Bidirectional Local-Matching (BiLMa) framework that jointly
optimize MLM and Masked Image Modeling (MIM) in TBPReID model training. With
this framework, our model is trained so as the labels of randomly masked both
image and text tokens are predicted by unmasked tokens. In addition, to narrow
the semantic gap between image and text in MIM, we propose Semantic MIM
(SemMIM), in which the labels of masked image tokens are automatically given by
a state-of-the-art human parser. Experimental results demonstrate that our
BiLMa framework with SemMIM achieves state-of-the-art Rank@1 and mAP scores on
three benchmarks.
Related papers
- Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search [16.7500024682162]
This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.
Experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.
arXiv Detail & Related papers (2024-06-16T08:37:24Z) - SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.