Word Level Timestamp Generation for Automatic Speech Recognition and Translation
- URL: http://arxiv.org/abs/2505.15646v1
- Date: Wed, 21 May 2025 15:24:29 GMT
- Title: Word Level Timestamp Generation for Automatic Speech Recognition and Translation
- Authors: Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg,
- Abstract summary: We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model.<n>Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages.
- Score: 28.176210372699618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.
Related papers
- TimeRefine: Temporal Grounding with Time Refining Video LLM [75.99665302872901]
Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.<n>We reformulate the temporal grounding task as a temporal refining task.<n>We incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth.
arXiv Detail & Related papers (2024-12-12T18:59:11Z) - Handling Numeric Expressions in Automatic Speech Recognition [56.972851337263755]
We compare cascaded and end-to-end approaches to recognize and format numeric expression.
Results show that adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
arXiv Detail & Related papers (2024-07-18T09:46:19Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - AutoTimes: Autoregressive Time Series Forecasters via Large Language Models [67.83502953961505]
AutoTimes projects time series into the embedding space of language tokens and autoregressively generates future predictions with arbitrary lengths.
We formulate time series as prompts, extending the context for prediction beyond the lookback window.
AutoTimes achieves state-of-the-art with 0.1% trainable parameters and over $5times$ training/inference speedup.
arXiv Detail & Related papers (2024-02-04T06:59:21Z) - Pre-trained Language Model with Prompts for Temporal Knowledge Graph
Completion [30.50032335014021]
We propose a novel TKGC model, namely Pre-trained Language Model with Prompts for TKGC (PPT)
We convert a series of sampled quadruples into pre-trained language model inputs and convert intervals between timestamps into different prompts to make coherent sentences with implicit semantic information.
Our model can effectively incorporate information from temporal knowledge graphs into the language models.
arXiv Detail & Related papers (2023-05-13T12:53:11Z) - FullStop:Punctuation and Segmentation Prediction for Dutch with
Transformers [1.2246649738388389]
The model we present is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available.
For every word in the input sequence, the models predicts a punctuation marker that follows the word.
Results show to be much better than a machine translation baseline approach.
arXiv Detail & Related papers (2023-01-09T13:12:05Z) - PromptCast: A New Prompt-based Learning Paradigm for Time Series
Forecasting [11.670324826998968]
In existing time series forecasting methods, the models take a sequence of numerical values as input and yield numerical values as output.
Inspired by the successes of pre-trained language foundation models, we propose a new forecasting paradigm: prompt-based time series forecasting.
In this novel task, the numerical input and output are transformed into prompts and the forecasting task is framed in a sentence-to-sentence manner.
arXiv Detail & Related papers (2022-09-20T10:15:35Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - NLP-CIC @ DIACR-Ita: POS and Neighbor Based Distributional Models for
Lexical Semantic Change in Diachronic Italian Corpora [62.997667081978825]
We present our systems and findings on unsupervised lexical semantic change for the Italian language.
The task is to determine whether a target word has evolved its meaning with time, only relying on raw-text from two time-specific datasets.
We propose two models representing the target words across the periods to predict the changing words using threshold and voting schemes.
arXiv Detail & Related papers (2020-11-07T11:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.