Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
- URL: http://arxiv.org/abs/2602.15707v1
- Date: Tue, 17 Feb 2026 16:41:51 GMT
- Title: Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
- Authors: Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili,
- Abstract summary: We propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities.<n>This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions.
- Score: 7.116403133334644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.
Related papers
- Do LLMs Benefit From Their Own Words? [56.73014497206615]
We find that removing prior assistant responses does not affect response quality on a large fraction of turns.<n>Omitting assistant-side context can reduce cumulative context lengths by up to 10x.<n>Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
arXiv Detail & Related papers (2026-02-27T18:58:26Z) - SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality [6.523396381538382]
SpeechLess is a wearable AR assistant that introduces a speech-based intent control paradigm grounded in personalized spatial memory.<n>Our results indicate that SpeechLess can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
arXiv Detail & Related papers (2026-01-31T16:01:32Z) - All You Need is One: Capsule Prompt Tuning with a Single Vector [86.68105855537762]
Current prompt-based learning methods rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts.<n>We introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning.<n>Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner.
arXiv Detail & Related papers (2025-10-19T00:02:59Z) - StepWrite: Adaptive Planning for Speech-Driven Text Generation [18.286742472385633]
StepWrite is a large language model-driven voice-based interaction system.<n>It enables structured, hands-free and eyes-free composition of longer-form texts while on the move.<n>It reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models.
arXiv Detail & Related papers (2025-08-06T01:50:17Z) - Creating General User Models from Computer Use [53.59999173952482]
This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer.<n>The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences.
arXiv Detail & Related papers (2025-05-16T04:00:31Z) - LLAMAPIE: Proactive In-Ear Conversation Assistants [9.312108526830665]
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices.<n>Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations.
arXiv Detail & Related papers (2025-05-07T02:08:56Z) - Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues [54.81155589931697]
Collaborative Instance object Navigation (CoIN) is a new task setting where the agent actively resolve uncertainties about the target instance.<n>We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA)<n>First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description.<n>An Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation.
arXiv Detail & Related papers (2024-12-02T08:16:38Z) - Distilling an End-to-End Voice Assistant Without Instruction Training Data [53.524071162124464]
Distilled Voice Assistant (DiVA) generalizes to Question Answering, Classification, and Translation.
We show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio.
arXiv Detail & Related papers (2024-10-03T17:04:48Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Context-dependent Instruction Tuning for Dialogue Response Generation [61.21790201307179]
Recent language models have achieved impressive performance in natural language computation tasks by incorporating instructions with task input during fine-tuning.
We introduce a context-based instruction fine-tuning framework for each multi-turn dialogue.
During the evaluation, the model generates instructions based on the previous context to self-guide the response.
arXiv Detail & Related papers (2023-11-13T01:25:30Z) - Rewriting the Script: Adapting Text Instructions for Voice Interaction [39.54213483588498]
We study the limitations of the dominant approach voice assistants take to complex task guidance.
We propose eight ways in which voice assistants can transform written sources into forms that are readily communicated through spoken conversation.
arXiv Detail & Related papers (2023-06-16T17:43:00Z) - NaRLE: Natural Language Models using Reinforcement Learning with Emotion
Feedback [0.37277730514654556]
"NARLE" is a framework for improving the natural language understanding of dialogue systems online without the need to collect human labels for customer data.
For two intent classification problems, we empirically show that using reinforcement learning to fine tune the pre-trained supervised learning models improves performance up to 43%.
arXiv Detail & Related papers (2021-10-05T16:24:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.