Related papers: End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

URL: http://arxiv.org/abs/2107.05382v1
Date: Wed, 7 Jul 2021 12:52:49 GMT
Title: End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning
Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima
Abstract summary: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems. The Key process in our learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset. Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
Score: 28.516240952627076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do not give special awareness to these phenomena, rich transcriptions explicitly convert them into special phenomenon tokens as well as textual tokens. In previous studies, the textual and phenomenon tokens were simultaneously estimated in an end-to-end manner. However, it is difficult to build accurate RT-ASR systems because large-scale rich transcription-style datasets are often unavailable. To solve this problem, our training method uses a limited rich transcription-style dataset and common transcription-style dataset simultaneously. The Key process in our semi-supervised learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset. To this end, we introduce style tokens which control phenomenon tokens are generated or not into transformer-based autoregressive modeling. We use this modeling for generating the pseudo-rich transcription-style datasets and for building RT-ASR system from the pseudo and original datasets. Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.

Related papers

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN) Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z)
General Detection-based Text Line Recognition [15.761142324480165]
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR) Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding. We improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
arXiv Detail & Related papers (2024-09-25T17:05:55Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people. automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z)
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z)
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio. A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z)
Improving Data Driven Inverse Text Normalization using Data Augmentation [14.820077884045645]
Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form. We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data. We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
arXiv Detail & Related papers (2022-07-20T06:07:26Z)
Context-Aware Transformer Transducer for Speech Recognition [21.916660252023707]
We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
arXiv Detail & Related papers (2021-11-05T04:14:35Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
TS-Net: OCR Trained to Switch Between Text Transcription Styles [0.0]
We propose to extend existing text recognition networks with a Transcription Style Block (TSB) TSB can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules. We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data.
arXiv Detail & Related papers (2021-03-09T15:21:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.