Rewriting the Script: Adapting Text Instructions for Voice Interaction
- URL: http://arxiv.org/abs/2306.09992v1
- Date: Fri, 16 Jun 2023 17:43:00 GMT
- Title: Rewriting the Script: Adapting Text Instructions for Voice Interaction
- Authors: Alyssa Hwang, Natasha Oza, Chris Callison-Burch, Andrew Head
- Abstract summary: We study the limitations of the dominant approach voice assistants take to complex task guidance.
We propose eight ways in which voice assistants can transform written sources into forms that are readily communicated through spoken conversation.
- Score: 39.54213483588498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice assistants have sharply risen in popularity in recent years, but their
use has been limited mostly to simple applications like music, hands-free
search, or control of internet-of-things devices. What would it take for voice
assistants to guide people through more complex tasks? In our work, we study
the limitations of the dominant approach voice assistants take to complex task
guidance: reading aloud written instructions. Using recipes as an example, we
observe twelve participants cook at home with a state-of-the-art voice
assistant. We learn that the current approach leads to nine challenges,
including obscuring the bigger picture, overwhelming users with too much
information, and failing to communicate affordances. Instructions delivered by
a voice assistant are especially difficult because they cannot be skimmed as
easily as written instructions. Alexa in particular did not surface crucial
details to the user or answer questions well. We draw on our observations to
propose eight ways in which voice assistants can ``rewrite the script'' --
summarizing, signposting, splitting, elaborating, volunteering, reordering,
redistributing, and visualizing -- to transform written sources into forms that
are readily communicated through spoken conversation. We conclude with a vision
of how modern advancements in natural language processing can be leveraged for
intelligent agents to guide users effectively through complex tasks.
Related papers
- Distilling an End-to-End Voice Assistant Without Instruction Training Data [53.524071162124464]
Distilled Voice Assistant (DiVA) generalizes to Question Answering, Classification, and Translation.
We show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio.
arXiv Detail & Related papers (2024-10-03T17:04:48Z) - Follow-on Question Suggestion via Voice Hints for Voice Assistants [29.531005346608215]
We tackle the novel task of suggesting questions with compact and natural voice hints to allow users to ask follow-up questions.
We propose baselines and an approach using sequence-to-sequence Transformers to generate spoken hints from a list of questions.
Results show that a naive approach of concatenating suggested questions creates poor voice hints.
arXiv Detail & Related papers (2023-10-25T22:22:18Z) - Referring to Screen Texts with Voice Assistants [5.62305568174015]
Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens.
Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen.
Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI.
arXiv Detail & Related papers (2023-06-10T22:43:16Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - DualVoice: Speech Interaction that Discriminates between Normal and
Whispered Voice Input [16.82591185507251]
There is no easy way to distinguish between commands being issued and text required to be input in speech.
The input of symbols and commands is also challenging because these may be misrecognized as text letters.
This study proposes a speech interaction method called DualVoice, by which commands can be input in a whispered voice and letters in a normal voice.
arXiv Detail & Related papers (2022-08-22T13:01:28Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Challenges in Supporting Exploratory Search through Voice Assistants [9.861790101853863]
As people get more familiar with voice assistants, they may increase their expectations for more complex tasks.
We outline four challenges in designing voice assistants that can better support exploratory search.
arXiv Detail & Related papers (2020-03-06T01:10:39Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.