Related papers: Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

URL: http://arxiv.org/abs/2505.24229v1
Date: Fri, 30 May 2025 05:41:03 GMT
Title: Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization
Authors: Luong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau,
Abstract summary: Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text.<n>We introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness.<n>Our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset.
Score: 0.19791587637442667
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset, all while maintaining low latency, ensuring seamless integration into ASR systems.

Related papers

HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation [19.997594859651233]
HENT-SRT is a novel framework that factorizes ASR and translation tasks to better handle reordering.<n>We improve computational efficiency by incorporating best practices from ASR transducers.<n>Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin.
arXiv Detail & Related papers (2025-06-02T18:37:50Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation [29.76274107159478]
Non-autoregressive Transformers (NATs) are applied in direct speech-to-speech translation systems. We introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) on the CVSS benchmark.
arXiv Detail & Related papers (2024-05-22T01:10:39Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Improving Robustness of Neural Inverse Text Normalization via Data-Augmentation, Semi-Supervised Learning, and Post-Aligning Method [4.343606621506086]
Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR) We propose a direct training approach that utilizes ASR-generated written or spoken text, with pairs augmented through ASR linguistic context emulation and a semi-supervised learning method enhanced by a large language model. Our proposed methods remarkably improved ITN performance in various ASR scenarios.
arXiv Detail & Related papers (2023-09-12T06:05:57Z)
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z)
DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR [20.42366884075422]
We propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
arXiv Detail & Related papers (2023-06-13T23:42:53Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.