Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
- URL: http://arxiv.org/abs/2412.18061v1
- Date: Tue, 24 Dec 2024 00:20:38 GMT
- Title: Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
- Authors: Hyunbae Jeon, Frederic Guintu, Rayvant Sahni,
- Abstract summary: This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach.
We aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios.
- Score: 0.0
- License:
- Abstract: Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
Related papers
- Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization [1.1368382184602488]
The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses.
Existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse linguistic contexts.
We propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
arXiv Detail & Related papers (2024-12-19T23:22:11Z) - Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Beam Prediction based on Large Language Models [51.45077318268427]
We formulate the millimeter wave (mmWave) beam prediction problem as a time series forecasting task.
We transform historical observations into text-based representations using a trainable tokenizer.
Our method harnesses the power of LLMs to predict future optimal beams.
arXiv Detail & Related papers (2024-08-16T12:40:01Z) - Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment [8.91053640932991]
quantization-aware direct preference optimization (QDPO) improves conversational abilities of quantized large language models (LLMs)
evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques.
arXiv Detail & Related papers (2024-07-03T12:19:06Z) - Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters [21.19251212483406]
Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications.
This paper explores a training recipe of an assistant model in speculative decoding, which is leveraged to draft and-then its future tokens are verified by the target LLM.
We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup in inference time compared to the previous methods.
arXiv Detail & Related papers (2024-06-24T16:06:50Z) - Turn-taking and Backchannel Prediction with Acoustic and Large Language
Model Fusion [38.78341787348164]
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM)
Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality.
arXiv Detail & Related papers (2024-01-26T08:59:07Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.