Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
- URL: http://arxiv.org/abs/2409.02239v2
- Date: Thu, 5 Sep 2024 11:34:00 GMT
- Title: Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
- Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai,
- Abstract summary: Transferring linguistic knowledge from a pretrained language model to an acoustic model has been shown to greatly improve the performance of automatic speech recognition.
We propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) for ASR.
- Score: 36.250914527327005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
Related papers
- LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z) - CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning [48.69343479132896]
CC-Tuning is a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level.<n>During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources.<n>Experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods.
arXiv Detail & Related papers (2025-06-01T07:20:55Z) - Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR [36.250914527327005]
Transferring linguistic knowledge from a pretrained language model to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR)<n> aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps.<n>We propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs.
arXiv Detail & Related papers (2025-05-19T13:13:18Z) - Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network [17.91342898415867]
Existing ATFL methods rely on training efficient networks using fine-grained annotations.<n>We propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance.<n>The proposed LOCO achieves SOTA performance on three public benchmarks.
arXiv Detail & Related papers (2025-05-03T17:57:57Z) - GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [12.303324248639266]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS)
GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency.
Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z) - LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition [43.19328760778868]
Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge.
We propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR.
arXiv Detail & Related papers (2024-08-11T13:19:27Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Curriculum Learning for Goal-Oriented Semantic Communications with a
Common Language [60.85719227557608]
A holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks.
A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener.
An optimization problem is defined to determine the perfect and abstract description of the events.
arXiv Detail & Related papers (2022-04-21T22:36:06Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - A Novel Speech Intelligibility Enhancement Model based on
CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model.
We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z) - Common Language for Goal-Oriented Semantic Communications: A Curriculum
Learning Framework [66.81698651016444]
A comprehensive semantic communications framework is proposed for enabling goal-oriented task execution.
A novel top-down framework that combines curriculum learning (CL) and reinforcement learning (RL) is proposed to solve this problem.
Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution time, and transmission cost during training.
arXiv Detail & Related papers (2021-11-15T19:13:55Z) - Conversational speech recognition leveraging effective fusion methods
for cross-utterance language modeling [12.153618111267514]
We put forward disparate conversation history fusion methods for language modeling in automatic speech recognition.
A novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history.
To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM.
arXiv Detail & Related papers (2021-11-05T09:07:23Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.