Related papers: Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

URL: http://arxiv.org/abs/2503.09962v1
Date: Thu, 13 Mar 2025 02:08:27 GMT
Title: Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Authors: Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu,
Abstract summary: We propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators.<n>Ham allows us to group textual descriptions with similar styles into the same cluster and apply prompt learning to mimic the description styles of different human annotators.<n>Ham significantly improves the generalization ability of ReID models.
Score: 20.748856943104375
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.

Related papers

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression [2.9998889086656586]
We show how Transformer-Based Classification (RvTC) replaces vocabulary-constrained classification with a flexible bin-based approach.<n>Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding.
arXiv Detail & Related papers (2025-07-20T15:05:24Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID [44.372336186832584]
We study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database. We obtain substantial training data via Multi-modal Large Language Models (MLLMs) We introduce a novel method that automatically identifies words in a description that do not correspond with the image.
arXiv Detail & Related papers (2024-05-08T10:15:04Z)
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models. MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z)
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.<n>Most existing methods emphasize the user context fusion process by memory networks or transformers.<n>We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z)
Generating Illustrated Instructions [41.613203340244155]
We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion.
arXiv Detail & Related papers (2023-12-07T18:59:20Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.