Controllable Time-Delay Transformer for Real-Time Punctuation Prediction
and Disfluency Detection
- URL: http://arxiv.org/abs/2003.01309v1
- Date: Tue, 3 Mar 2020 03:17:29 GMT
- Title: Controllable Time-Delay Transformer for Real-Time Punctuation Prediction
and Disfluency Detection
- Authors: Qian Chen, Mengzhe Chen, Bo Li, Wen Wang
- Abstract summary: We propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time.
The proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.
- Score: 10.265607222257263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increased applications of automatic speech recognition (ASR) in
recent years, it is essential to automatically insert punctuation marks and
remove disfluencies in transcripts, to improve the readability of the
transcripts as well as the performance of subsequent applications, such as
machine translation, dialogue systems, and so forth. In this paper, we propose
a Controllable Time-delay Transformer (CT-Transformer) model that jointly
completes the punctuation prediction and disfluency detection tasks in real
time. The CT-Transformer model facilitates freezing partial outputs with
controllable time delay to fulfill the real-time constraints in partial
decoding required by subsequent applications. We further propose a fast
decoding strategy to minimize latency while maintaining competitive
performance. Experimental results on the IWSLT2011 benchmark dataset and an
in-house Chinese annotated dataset demonstrate that the proposed approach
outperforms the previous state-of-the-art models on F-scores and achieves a
competitive inference speed.
Related papers
- Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models [4.655740975414312]
This paper introduces Test-Time Low-rank adaptation (TTL) as an alternative to prompt tuning for zero-shot generalizations of large-scale vision-language models (VLMs)
TTL offers a test-time-efficient adaptation approach that updates the attention weights of the transformer by maximizing prediction confidence.
arXiv Detail & Related papers (2024-07-22T17:59:19Z) - LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
Capitalization Capabilities of end-to-end ASR Models [58.790604613878216]
We introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models.
The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models.
arXiv Detail & Related papers (2023-10-04T16:23:37Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming
Disfluency Detection [3.884530687475798]
Streaming BERT-based sequence tagging model is capable of detecting disfluencies in real-time.
Model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.
arXiv Detail & Related papers (2022-05-02T02:13:24Z) - FSR: Accelerating the Inference Process of Transducer-Based Models by
Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state.
The number of blank tokens in the prediction results accounts for nearly 90% of all tokens.
We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit
Alignment [38.85714892799518]
AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.
Our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.
arXiv Detail & Related papers (2020-03-04T08:44:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.