A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
- URL: http://arxiv.org/abs/2502.01588v1
- Date: Mon, 03 Feb 2025 18:20:29 GMT
- Title: A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
- Authors: Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi,
- Abstract summary: We propose a novel differentiable alignment framework based on one-dimensional optimal transport.
We show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC.
- Score: 12.835774667953187
- License:
- Abstract: Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.
Related papers
- Preference Alignment Improves Language Model-Based TTS [76.70693823683091]
preference alignment algorithms adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content.
With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores.
arXiv Detail & Related papers (2024-09-19T01:58:19Z) - An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing [8.892147201091726]
We propose an end-to-end reinforcement learning based order-dispatching approach in Didi.
We employ a two-layer Decision Process framework to model this problem, and present underlineDeep underlineDouble underlineScalable underlineNetwork (DSN2), an encoder-decoder structure network to generate order assignments.
By leveraging contextual dynamics, our approach can adapt to the behavioral patterns for better performance.
arXiv Detail & Related papers (2024-08-20T01:30:53Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Align With Purpose: Optimize Desired Properties in CTC Models with a
General Plug-and-Play Framework [8.228892600588765]
Connectionist Temporal Classification ( CTC) is a widely used criterion for training sequence-to-sequence (seq2seq) models.
We propose $textitAlign With Purpose, a $textbfgeneral Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion.
We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset.
arXiv Detail & Related papers (2023-07-04T13:34:47Z) - A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition [26.79184118279807]
We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR.
word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs.
We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
arXiv Detail & Related papers (2023-04-15T18:34:29Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions [14.376418789524783]
We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer.
Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed.
arXiv Detail & Related papers (2021-04-06T18:00:03Z) - Exploring Dynamic Context for Multi-path Trajectory Prediction [33.66335553588001]
We propose a novel framework, named Dynamic Context Network (DCENet)
In our framework, the spatial context between agents is explored by using self-attention architectures.
A set of future trajectories for each agent is predicted conditioned on the learned spatial-temporal context.
arXiv Detail & Related papers (2020-10-30T13:39:20Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.