Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
- URL: http://arxiv.org/abs/2511.08971v1
- Date: Thu, 13 Nov 2025 01:22:31 GMT
- Title: Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
- Authors: Sicheng Yang, Yukai Huang, Weitong Cai, Shitong Sun, You He, Jiankang Deng, Hang Zhang, Jifei Song, Zhensong Zhang,
- Abstract summary: The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
- Score: 60.63465682731118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity. This challenge arises from a combination of underspecified language, imperfect visual data, and deictic gestures, which frequently leads to task failure. Existing monolithic Vision-Language Models (VLMs) struggle to resolve these multimodal ambiguous inputs, often failing silently or hallucinating responses. To address these ambiguities, we introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks. Specifically, our framework consists of three synergistic modules: (1) a text clarifier that uses dialogue-driven reasoning to interactively disambiguate linguistic intent, (2) a vision clarifier that delivers real-time guidance feedback, instructing users to adjust their positioning for improved capture quality, and (3) a cross-modal clarifier with grounding mechanism that robustly interprets 3D pointing gestures and identifies the specific objects users are pointing to. Extensive experiments demonstrate that our framework improves the intent clarification performance of small language models (4--8B) by approximately 30%, making them competitive with significantly larger counterparts. We also observe consistent gains when applying our framework to these larger models. Furthermore, our vision clarifier increases corrective guidance accuracy by over 20%, and our cross-modal clarifier improves semantic answer accuracy for referential grounding by 5%. Overall, our method provides a plug-and-play framework that effectively resolves multimodal ambiguity and significantly enhances user experience in egocentric interaction.
Related papers
- Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction [0.0]
Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations.<n>We conduct a systematic evaluation of conversational reliability through three representative tasks.<n>We observe substantial declines in reliability, particularly for smaller models.
arXiv Detail & Related papers (2026-03-02T03:59:40Z) - Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding [30.694164546429928]
We propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding.<n>We show that BARE achieves state-of-the-art performance and delivers superior computational efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-04T13:30:06Z) - Reasoning-Aware Multimodal Fusion for Hateful Video Detection [28.9889316637547]
Hate speech in online videos is posing an increasingly serious threat to digital platforms.<n>Existing methods often struggle to effectively fuse the complex semantic relationships between modalities.<n>We propose an innovative Reasoning-Aware Multimodal Fusion framework.
arXiv Detail & Related papers (2025-12-02T13:24:17Z) - When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z) - Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling [3.5408685781175016]
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information.<n>We propose a lightweight decoder-based architecture with token-wise dynamic gating for adaptive fusion of linguistic and visual cues.
arXiv Detail & Related papers (2025-10-09T17:10:36Z) - Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models [65.23999399834638]
We introduce DeceptionDecoded, a benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles.<n>The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities.<n>It supports three intent-centric tasks: misleading intent detection, misleading source attribution, and creator desire inference.
arXiv Detail & Related papers (2025-05-21T13:14:32Z) - Intent Representation Learning with Large Language Model for Recommendation [11.118517297006894]
We propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), to construct multimodal intents and enhance recommendations.<n>Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations.<n>To better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations.
arXiv Detail & Related papers (2025-02-05T16:08:05Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning [50.1035273069458]
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems.
We propose a multi-level MMCL framework to apply contrastive learning at three levels, including utterance level, slot level, and word level.
Our framework achieves new state-of-the-art results on two public multi-intent SLU datasets.
arXiv Detail & Related papers (2024-05-31T14:34:23Z) - Can Your Model Tell a Negation from an Implicature? Unravelling
Challenges With Intent Encoders [24.42199777529863]
Large Language Models (LLMs) enable embeddings allowing one to adjust semantics over the embedding space using prompts.
Traditional evaluation benchmarks rely solely on task metrics that don't particularly measure gaps related to semantic understanding.
We propose an intent semantic toolkit that gives a more holistic view of intent embedding models.
arXiv Detail & Related papers (2024-03-07T08:32:17Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.