LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
- URL: http://arxiv.org/abs/2504.10174v2
- Date: Tue, 15 Apr 2025 07:41:21 GMT
- Title: LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
- Authors: Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng,
- Abstract summary: We introduce a new task called interactive person re-identification (Inter-ReID)<n>Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses.<n>We propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts.
- Score: 23.629373698103212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.
Related papers
- ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models [49.09606704563898]
Person re-identification is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views.<n>We propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification.<n>We introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.
arXiv Detail & Related papers (2025-02-27T10:34:14Z) - Exploring Rewriting Approaches for Different Conversational Tasks [63.56404271441824]
The exact rewriting approach may often depend on the use case and application-specific tasks supported by the conversational assistant.
We systematically investigate two different approaches, denoted as rewriting and fusion, on two fundamentally different generation tasks.
Our results indicate that the specific rewriting or fusion approach highly depends on the underlying use case and generative task.
arXiv Detail & Related papers (2025-02-26T06:05:29Z) - Enhancing Answer Attribution for Faithful Text Generation with Large Language Models [5.065947993017158]
We propose new methods for producing more independent and contextualized claims for better retrieval and attribution.
New methods are evaluated and shown to improve the performance of answer attribution components.
arXiv Detail & Related papers (2024-10-22T15:37:46Z) - Venn Diagram Prompting : Accelerating Comprehension with Scaffolding Effect [0.0]
We introduce Venn Diagram (VD) Prompting, an innovative prompting technique which allows Large Language Models (LLMs) to combine and synthesize information across documents.
Our proposed technique also aims to eliminate the inherent position bias in the LLMs, enhancing consistency in answers by removing sensitivity to the sequence of input information.
In the experiments performed on four public benchmark question-answering datasets, VD prompting continually matches or surpasses the performance of a meticulously crafted instruction prompt.
arXiv Detail & Related papers (2024-06-08T06:27:26Z) - Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification [62.894790379098005]
We propose a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions.
Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions.
We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework.
arXiv Detail & Related papers (2024-05-28T03:35:46Z) - Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering [4.114444605090133]
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
arXiv Detail & Related papers (2023-01-11T09:16:34Z) - Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query [25.398090300086302]
We propose an interactive retrieval framework called Part2Whole to tackle this problem.
An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query.
We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
arXiv Detail & Related papers (2021-03-02T11:27:05Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z) - Learning an Effective Context-Response Matching Model with
Self-Supervised Tasks for Retrieval-based Dialogues [88.73739515457116]
We introduce four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination.
We jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner.
Experiment results indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection.
arXiv Detail & Related papers (2020-09-14T08:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.