Align With Purpose: Optimize Desired Properties in CTC Models with a
General Plug-and-Play Framework
- URL: http://arxiv.org/abs/2307.01715v3
- Date: Thu, 7 Mar 2024 17:59:25 GMT
- Title: Align With Purpose: Optimize Desired Properties in CTC Models with a
General Plug-and-Play Framework
- Authors: Eliya Segev, Maya Alroy, Ronen Katsir, Noam Wies, Ayana Shenhav, Yael
Ben-Oren, David Zar, Oren Tadmor, Jacob Bitterman, Amnon Shashua and Tal
Rosenwein
- Abstract summary: Connectionist Temporal Classification ( CTC) is a widely used criterion for training sequence-to-sequence (seq2seq) models.
We propose $textitAlign With Purpose, a $textbfgeneral Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion.
We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset.
- Score: 8.228892600588765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Connectionist Temporal Classification (CTC) is a widely used criterion for
training supervised sequence-to-sequence (seq2seq) models. It enables learning
the relations between input and output sequences, termed alignments, by
marginalizing over perfect alignments (that yield the ground truth), at the
expense of imperfect alignments. This binary differentiation of perfect and
imperfect alignments falls short of capturing other essential alignment
properties that hold significance in other real-world applications. Here we
propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play
framework}$ for enhancing a desired property in models trained with the CTC
criterion. We do that by complementing the CTC with an additional loss term
that prioritizes alignments according to a desired property. Our method does
not require any intervention in the CTC loss function, enables easy
optimization of a variety of properties, and allows differentiation between
both perfect and imperfect alignments. We apply our framework in the domain of
Automatic Speech Recognition (ASR) and show its generality in terms of property
selection, architectural choice, and scale of training dataset (up to 280,000
hours). To demonstrate the effectiveness of our framework, we apply it to two
unrelated properties: emission time and word error rate (WER). For the former,
we report an improvement of up to 570ms in latency optimization with a minor
reduction in WER, and for the latter, we report a relative improvement of 4.5%
WER over the baseline models. To the best of our knowledge, these applications
have never been demonstrated to work on a scale of data as large as ours.
Notably, our method can be implemented using only a few lines of code, and can
be extended to other alignment-free loss functions and to domains other than
ASR.
Related papers
- Physically Feasible Semantic Segmentation [58.17907376475596]
State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion.
Our method, Physically Feasible Semantic (PhyFea), extracts explicit physical constraints that govern spatial class relations.
PhyFea yields significant performance improvements in mIoU over each state-of-the-art network we use.
arXiv Detail & Related papers (2024-08-26T22:39:08Z) - Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes.
In this work we study a simple scheme for adaptive inference.
We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z) - Indirectly Parameterized Concrete Autoencoders [40.35109085799772]
Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications.
Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications.
arXiv Detail & Related papers (2024-03-01T14:41:51Z) - Domain Aligned CLIP for Few-shot Classification [3.5326413171911555]
Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model.
We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
arXiv Detail & Related papers (2023-11-15T18:34:26Z) - Adaptive Neural Ranking Framework: Toward Maximized Business Goal for
Cascade Ranking Systems [33.46891569350896]
Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems.
Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order.
We name this method as Adaptive Neural Ranking Framework (abbreviated as ARF)
arXiv Detail & Related papers (2023-10-16T14:43:02Z) - Self-distillation Regularized Connectionist Temporal Classification Loss
for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions.
CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy.
We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy.
We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z) - Robust Optimal Transport with Applications in Generative Modeling and
Domain Adaptation [120.69747175899421]
Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation.
We propose a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications.
Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions.
arXiv Detail & Related papers (2020-10-12T17:13:40Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.