What BERT Based Language Models Learn in Spoken Transcripts: An
Empirical Study
- URL: http://arxiv.org/abs/2109.09105v2
- Date: Tue, 21 Sep 2021 05:24:51 GMT
- Title: What BERT Based Language Models Learn in Spoken Transcripts: An
Empirical Study
- Authors: Ayush Kumar, Mukuntha Narayanan Sundararaman, Jithendra Vepa
- Abstract summary: Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU)
In this work, we propose to dissect SLU into three representative properties: speakeral (disfluency, pause, overtalk), channel (conversation-type, turn-tasks) and ASR (insertion, deletion,substitution)
We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues.
- Score: 6.696983725360809
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language Models (LMs) have been ubiquitously leveraged in various tasks
including spoken language understanding (SLU). Spoken language requires careful
understanding of speaker interactions, dialog states and speech induced
multimodal behaviors to generate a meaningful representation of the
conversation. In this work, we propose to dissect SLU into three representative
properties:conversational (disfluency, pause, overtalk), channel (speaker-type,
turn-tasks) and ASR (insertion, deletion,substitution). We probe BERT based
language models (BERT, RoBERTa) trained on spoken transcripts to investigate
its ability to understand multifarious properties in absence of any speech
cues. Empirical results indicate that LM is surprisingly good at capturing
conversational properties such as pause prediction and overtalk detection from
lexical tokens. On the downsides, the LM scores low on turn-tasks and ASR
errors predictions. Additionally, pre-training the LM on spoken transcripts
restrain its linguistic understanding. Finally, we establish the efficacy and
transferability of the mentioned properties on two benchmark datasets:
Switchboard Dialog Act and Disfluency datasets.
Related papers
- ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.
This paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation [15.225080891662675]
Speech comprehension can benefit from inference of massive pre-trained language models.
We experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module.
arXiv Detail & Related papers (2020-05-17T10:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.