Multiple Representation Transfer from Large Language Models to
End-to-End ASR Systems
- URL: http://arxiv.org/abs/2309.04031v2
- Date: Mon, 25 Dec 2023 07:28:07 GMT
- Title: Multiple Representation Transfer from Large Language Models to
End-to-End ASR Systems
- Authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka,
George Saon
- Abstract summary: Transferring the knowledge of large language models (LLMs) is a promising technique to incorporate linguistic knowledge into end-to-end automatic speech recognition (ASR) systems.
We show that transferring multiple representations of LLMs can be an effective alternative to transferring only a single representation.
- Score: 27.39706173574983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring the knowledge of large language models (LLMs) is a promising
technique to incorporate linguistic knowledge into end-to-end automatic speech
recognition (ASR) systems. However, existing works only transfer a single
representation of LLM (e.g. the last layer of pretrained BERT), while the
representation of a text is inherently non-unique and can be obtained variously
from different layers, contexts and models. In this work, we explore a wide
range of techniques to obtain and transfer multiple representations of LLMs
into a transducer-based ASR system. While being conceptually simple, we show
that transferring multiple representations of LLMs can be an effective
alternative to transferring only a single representation.
Related papers
- Augmenting Multi-Agent Communication with State Delta Trajectory [31.127137626348098]
We propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another.<n>We find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process.<n> experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols.
arXiv Detail & Related papers (2025-06-24T00:38:25Z) - Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.
Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.
In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z) - MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems [8.971049629873185]
MTLM is a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives.<n>It supports multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring.<n>Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies.
arXiv Detail & Related papers (2025-02-14T10:21:10Z) - Speech Recognition Rescoring with Large Speech-Text Foundation Models [20.145389016219106]
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data.
Automatic speech recognition (ASR) systems are often limited by available transcribed speech data.
Recent multi-modal large language models have demonstrated strong spoken language understanding.
arXiv Detail & Related papers (2024-09-25T06:17:23Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing [17.92378239787507]
We present a decoder-only Discrete Multimodal Language Model (DMLM)
DMLM can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision)
Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training.
arXiv Detail & Related papers (2024-06-04T20:08:25Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities.
We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - Making LLaMA SEE and Draw with SEED Tokenizer [69.1083058794092]
We introduce SEED, an elaborate image tokenizer that empowers Large Language Models with the ability to SEE and Draw.
With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe.
SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation.
arXiv Detail & Related papers (2023-10-02T14:03:02Z) - SLM: Bridge the thin gap between speech and text foundation models [45.319071954143325]
Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
arXiv Detail & Related papers (2023-09-30T02:27:45Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - Universal Sentence Representation Learning with Conditional Masked
Language Model [7.334766841801749]
We present Conditional Masked Language Modeling (M) to effectively learn sentence representations.
Our English CMLM model achieves state-of-the-art performance on SentEval.
As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains.
arXiv Detail & Related papers (2020-12-28T18:06:37Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.