Related papers: StepWrite: Adaptive Planning for Speech-Driven Text Generation

StepWrite: Adaptive Planning for Speech-Driven Text Generation

URL: http://arxiv.org/abs/2508.04011v1
Date: Wed, 06 Aug 2025 01:50:17 GMT
Title: StepWrite: Adaptive Planning for Speech-Driven Text Generation
Authors: Hamza El Alaoui, Atieh Taheri, Yi-Hao Peng, Jeffrey P. Bigham,
Abstract summary: StepWrite is a large language model-driven voice-based interaction system.<n>It enables structured, hands-free and eyes-free composition of longer-form texts while on the move.<n>It reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models.
Score: 18.286742472385633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions--capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.

Related papers

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU [7.116403133334644]
We propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities.<n>This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions.
arXiv Detail & Related papers (2026-02-17T16:41:51Z)
VIBEVOICE-ASR Technical Report [95.57263110940973]
VibeVoice-ASR addresses challenges of context fragmentation and multi-speaker complexity in long-form audio.<n>It supports over 50 languages, requires no explicit language setting, and handles code-switching within and across utterances.
arXiv Detail & Related papers (2026-01-26T06:11:51Z)
Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations [45.06725378575657]
We present Empathic Prompting, a framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context.<n>The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting.
arXiv Detail & Related papers (2025-10-23T17:08:03Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
Semantics-Aware Human Motion Generation from Audio Instructions [25.565742045932236]
This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio.<n>We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs.
arXiv Detail & Related papers (2025-05-29T14:16:27Z)
InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training [23.330297074014315]
In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training.<n>InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion.<n>Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.
arXiv Detail & Related papers (2025-03-04T16:34:14Z)
Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis [3.8251125989631674]
We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system. It derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech. Our system showcases competitive inference time performance when benchmarked against state-of-the-art TTS models.
arXiv Detail & Related papers (2024-10-24T23:18:02Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types. Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning. We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Rewriting the Script: Adapting Text Instructions for Voice Interaction [39.54213483588498]
We study the limitations of the dominant approach voice assistants take to complex task guidance. We propose eight ways in which voice assistants can transform written sources into forms that are readily communicated through spoken conversation.
arXiv Detail & Related papers (2023-06-16T17:43:00Z)
MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR) We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z)
Hierarchical Summarization for Longform Spoken Dialog [1.995792341399967]
Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor. Compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure. We propose a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges.
arXiv Detail & Related papers (2021-08-21T23:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.