Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification
- URL: http://arxiv.org/abs/2310.11210v1
- Date: Tue, 17 Oct 2023 12:39:16 GMT
- Title: Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification
- Authors: Shuanglin Yan, Neng Dong, Jun Liu, Liyan Zhang, Jinhui Tang
- Abstract summary: Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
- Score: 34.289949134802086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image person re-identification (TIReID) retrieves pedestrian images
of the same identity based on a query text. However, existing methods for
TIReID typically treat it as a one-to-one image-text matching problem, only
focusing on the relationship between image-text pairs within a view. The
many-to-many matching between image-text pairs across views under the same
identity is not taken into account, which is one of the main reasons for the
poor performance of existing methods. To this end, we propose a simple yet
effective framework, called LCR$^2$S, for modeling many-to-many correspondences
of the same identity by learning comprehensive representations for both
modalities from a novel perspective. We construct a support set for each image
(text) by using other images (texts) under the same identity and design a
multi-head attentional fusion module to fuse the image (text) and its support
set. The resulting enriched image and text features fuse information from
multiple views, which are aligned to train a "richer" TIReID model with
many-to-many correspondences. Since the support set is unavailable during
inference, we propose to distill the knowledge learned by the "richer" model
into a lightweight model for inference with a single image/text as input. The
lightweight model focuses on semantic association and reasoning of multi-view
information, which can generate a comprehensive representation containing
multi-view information with only a single-view input to perform accurate
text-to-image retrieval during inference. In particular, we use the intra-modal
features and inter-modal semantic relations of the "richer" model to supervise
the lightweight model to inherit its powerful capability. Extensive experiments
demonstrate the effectiveness of LCR$^2$S, and it also achieves new
state-of-the-art performance on three popular TIReID datasets.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Image-Text Matching with Multi-View Attention [1.92360022393132]
Existing two-stream models for image-text matching show good performance while ensuring retrieval speed.
We propose a multi-view attention approach for two-stream image-text matching MVAM (textbfMulti-textbfView textbfAttention textbfModel)
Experiment results on MSCOCO and Flickr30K show that our proposed model brings improvements over existing models.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER [4.510210055307459]
multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
arXiv Detail & Related papers (2021-02-05T02:45:30Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.