Personalized Image Descriptions from Attention Sequences
- URL: http://arxiv.org/abs/2512.06662v1
- Date: Sun, 07 Dec 2025 05:23:18 GMT
- Title: Personalized Image Descriptions from Attention Sequences
- Authors: Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Hoai, Dimitris Samaras,
- Abstract summary: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles.<n>Existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns.<n>We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation.<n>Our method, DEPER, learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining.
- Score: 55.65023709100682
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.
Related papers
- Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models [9.722829662835233]
FineXL can improve the accuracy of explainability by 56%, when different personalization scenarios are applied to multiple types of image generation models.<n>This paper presents a new technique, namely textbfFineXL, towards textbfFine-grained etextbfXplainability in natural textbfLanguage for personalized image generation models.
arXiv Detail & Related papers (2025-11-02T16:08:24Z) - Improving Personalized Search with Regularized Low-Rank Parameter Updates [52.29168893900888]
We show how to adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval.<n>We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion.<n>Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries.
arXiv Detail & Related papers (2025-06-11T21:15:21Z) - Visual Persona: Foundation Model for Full-Body Human Customization [36.135949939650786]
We introduce Visual Persona, a model for text-to-image full-body human customization.<n>Our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations.<n>Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs.
arXiv Detail & Related papers (2025-03-19T16:45:47Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - Imagine yourself: Tuning-Free Personalized Image Generation [39.63411174712078]
We introduce Imagine yourself, a state-of-the-art model designed for personalized image generation.
It operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments.
Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment.
arXiv Detail & Related papers (2024-09-20T09:21:49Z) - U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation [18.841473623776153]
State-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space.
A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes.
Experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts.
arXiv Detail & Related papers (2024-03-29T15:20:34Z) - Semantic and Expressive Variation in Image Captions Across Languages [26.766596770616655]
We study how people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli.<n>By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression.<n>Our work points towards the need to accounttuning for and embrace the diversity of human perception in the computer vision community.
arXiv Detail & Related papers (2023-10-22T16:51:42Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.