Speechworthy Instruction-tuned Language Models
- URL: http://arxiv.org/abs/2409.14672v1
- Date: Mon, 23 Sep 2024 02:34:42 GMT
- Title: Speechworthy Instruction-tuned Language Models
- Authors: Hyundong Cho, Nicolaas Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May,
- Abstract summary: We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
- Score: 71.8586707840169
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current instruction-tuned language models are exclusively trained with textual preference data and thus are often not aligned with the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore (i) prompting strategies grounded in radio-industry best practices and (ii) preference learning using a novel speech-based preference data of 20K samples, generated with a wide spectrum of prompts that induce varying dimensions of speech-suitability and labeled by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. Interestingly, we find that prompting and preference learning can be additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Device Directedness with Contextual Cues for Spoken Dialog Systems [15.96415881820669]
We define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins.
We use low-level speech representations from a self-supervised representation learning model for our downstream classification task.
We propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training.
arXiv Detail & Related papers (2022-11-23T19:49:11Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based
Chatbots [47.40380290055558]
A new model, named Speaker-Aware BERT (SA-BERT), is proposed to make the model aware of the speaker change information.
A speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues.
arXiv Detail & Related papers (2020-04-07T02:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.