Related papers: TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

URL: http://arxiv.org/abs/2512.21616v1
Date: Thu, 25 Dec 2025 10:23:56 GMT
Title: TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant
Authors: Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou,
Abstract summary: Long-Context MLLM Personalization evaluation benchmark is proposed.<n>New framework TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept.
Score: 32.497044980186544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" -> "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.

Related papers

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z)
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z)
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models [53.304699445700926]
We introduce the Retrieval Augmented Personalization framework for MLLMs' personalization.<n>Starting from a general MLLM, we turn it into a personalized assistant in three steps.<n>By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning.
arXiv Detail & Related papers (2024-10-17T09:10:26Z)
Personalized Visual Instruction Tuning [30.677058613937067]
multimodal large language models (MLLMs) can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices. We introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image.
arXiv Detail & Related papers (2024-10-09T17:46:53Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation [43.29667566560533]
We introduce a stylized dialogue dataset StyleEval with 38 styles by leveraging the generative power of Large Language Models (LLMs) We propose the stylized dialogue framework StyleChat via recitation-augmented memory strategy and multi-task style learning strategy to promote generalization ability.
arXiv Detail & Related papers (2024-03-18T03:26:18Z)
Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z)
TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs. We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
Generating Personalized Dialogue via Multi-Task Meta-Learning [28.721411816698563]
We propose a novel multi-task meta-learning approach to personalized dialogue generation. The model is tasked with generating personalized responses based on only the dialogue context. We introduce 2 frameworks that adopt the proposed multi-task meta-learning approach.
arXiv Detail & Related papers (2021-08-07T06:38:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.