Improving Personalized Search with Regularized Low-Rank Parameter Updates
- URL: http://arxiv.org/abs/2506.10182v1
- Date: Wed, 11 Jun 2025 21:15:21 GMT
- Title: Improving Personalized Search with Regularized Low-Rank Parameter Updates
- Authors: Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg, Bryan Russell,
- Abstract summary: We show how to adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval.<n>We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion.<n>Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries.
- Score: 52.29168893900888
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
Related papers
- Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)<n>Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z) - Vision Language Model-based Caption Evaluation Method Leveraging Visual
Context Extraction [27.00018283430169]
This paper presents VisCE$2$, a vision language model-based caption evaluation method.
Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships.
arXiv Detail & Related papers (2024-02-28T01:29:36Z) - LaViP:Language-Grounded Visual Prompts [27.57227844809257]
We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks.
By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder.
Our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained.
arXiv Detail & Related papers (2023-12-18T05:50:10Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.<n>Most existing methods emphasize the user context fusion process by memory networks or transformers.<n>We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - "This is my unicorn, Fluffy": Personalizing frozen vision-language
representations [31.618829097336047]
We introduce a new learning setup called Personalized Vision & Language (PerVL)
In PerVL, one should learn personalized concepts independently of the downstream task.
We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation.
arXiv Detail & Related papers (2022-04-04T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.