Related papers: Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

URL: http://arxiv.org/abs/2307.01715v3
Date: Thu, 7 Mar 2024 17:59:25 GMT
Title: Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
Authors: Eliya Segev, Maya Alroy, Ronen Katsir, Noam Wies, Ayana Shenhav, Yael Ben-Oren, David Zar, Oren Tadmor, Jacob Bitterman, Amnon Shashua and Tal Rosenwein
Abstract summary: Connectionist Temporal Classification ( CTC) is a widely used criterion for training sequence-to-sequence (seq2seq) models. We propose $textitAlign With Purpose, a $textbfgeneral Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset.
Score: 8.228892600588765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

Related papers

FACT: Multinomial Misalignment Classification for Point Cloud Registration [1.256245863497516]
We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class.
arXiv Detail & Related papers (2025-04-09T07:01:57Z)
PIPA: Preference Alignment as Prior-Informed Statistical Estimation [57.24096291517857]
We introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework. PIPA accommodates both paired and unpaired data, as well as answer and step-level annotations. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N.
arXiv Detail & Related papers (2025-02-09T04:31:30Z)
A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport [12.835774667953187]
We propose a novel differentiable alignment framework based on one-dimensional optimal transport. We show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC.
arXiv Detail & Related papers (2025-02-03T18:20:29Z)
Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z)
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
Physically Feasible Semantic Segmentation [58.17907376475596]
State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion. Our method, Physically Feasible Semantic (PhyFea), extracts explicit physical constraints that govern spatial class relations. PhyFea yields significant performance improvements in mIoU over each state-of-the-art network we use.
arXiv Detail & Related papers (2024-08-26T22:39:08Z)
Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z)
Indirectly Parameterized Concrete Autoencoders [40.35109085799772]
Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications. Recent developments in neural network-based embedded feature selection show promising results across a wide range of applications.
arXiv Detail & Related papers (2024-03-01T14:41:51Z)
Domain Aligned CLIP for Few-shot Classification [3.5326413171911555]
Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model. We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
arXiv Detail & Related papers (2023-11-15T18:34:26Z)
Adaptive Neural Ranking Framework: Toward Maximized Business Goal for Cascade Ranking Systems [33.46891569350896]
Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems. Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order. We name this method as Adaptive Neural Ranking Framework (abbreviated as ARF)
arXiv Detail & Related papers (2023-10-16T14:43:02Z)
Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy. We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z)
Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations. We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z)
Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z)
Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z)
Robust Optimal Transport with Applications in Generative Modeling and Domain Adaptation [120.69747175899421]
Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation. We propose a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications. Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions.
arXiv Detail & Related papers (2020-10-12T17:13:40Z)
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair. We propose a novel architecture with cross modality augmentation. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.