Turn-taking and Backchannel Prediction with Acoustic and Large Language
Model Fusion
- URL: http://arxiv.org/abs/2401.14717v1
- Date: Fri, 26 Jan 2024 08:59:07 GMT
- Title: Turn-taking and Backchannel Prediction with Acoustic and Large Language
Model Fusion
- Authors: Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di
He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran
- Abstract summary: We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM)
Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality.
- Score: 38.78341787348164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an approach for continuous prediction of turn-taking and
backchanneling locations in spoken dialogue by fusing a neural acoustic model
with a large language model (LLM). Experiments on the Switchboard human-human
conversation dataset demonstrate that our approach consistently outperforms the
baseline models with single modality. We also develop a novel multi-task
instruction fine-tuning strategy to further benefit from LLM-encoded knowledge
for understanding the tasks and conversational contexts, leading to additional
improvements. Our approach demonstrates the potential of combined LLMs and
acoustic models for a more natural and conversational interaction between
humans and speech-enabled AI agents.
Related papers
- Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.
This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels.
Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z) - Towards Understanding Counseling Conversations: Domain Knowledge and
Large Language Models [22.588557390720236]
This paper proposes a systematic approach to examine the efficacy of domain knowledge and large language models (LLMs) in better representing counseling conversations.
We empirically show that state-of-the-art language models such as Transformer-based models and GPT models fail to predict the conversation outcome.
arXiv Detail & Related papers (2024-02-22T01:02:37Z) - Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Conversational speech recognition leveraging effective fusion methods
for cross-utterance language modeling [12.153618111267514]
We put forward disparate conversation history fusion methods for language modeling in automatic speech recognition.
A novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history.
To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM.
arXiv Detail & Related papers (2021-11-05T09:07:23Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.