Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
- URL: http://arxiv.org/abs/2005.08213v2
- Date: Sat, 8 Aug 2020 07:43:49 GMT
- Title: Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
- Authors: Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
- Abstract summary: Speech comprehension can benefit from inference of massive pre-trained language models.
We experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module.
- Score: 15.225080891662675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech is one of the most effective means of communication and is full of
information that helps the transmission of utterer's thoughts. However, mainly
due to the cumbersome processing of acoustic features, phoneme or word
posterior probability has frequently been discarded in understanding the
natural language. Thus, some recent spoken language understanding (SLU) modules
have utilized end-to-end structures that preserve the uncertainty information.
This further reduces the propagation of speech recognition error and guarantees
computational efficiency. We claim that in this process, the speech
comprehension can benefit from the inference of massive pre-trained language
models (LMs). We transfer the knowledge from a concrete Transformer-based text
LM to an SLU module which can face a data shortage, based on recent cross-modal
distillation methodologies. We demonstrate the validity of our proposal upon
the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we
experimentally verify our hypothesis that the knowledge could be shared from
the top layer of the LM to a fully speech-based module, in which the abstracted
speech is expected to meet the semantic representation.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Improving Spoken Language Understanding By Exploiting ASR N-best
Hypotheses [22.332683746361294]
The natural language understanding (NLU) module takes interpretations of a speech from the automatic speech recognition (ASR) module as the input.
The ASR module might misrecognize some speeches and the first best interpretation could be erroneous and noisy.
We introduce a series of simple yet efficient models for improving the understanding of semantics of the input speeches.
arXiv Detail & Related papers (2020-01-11T05:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.