End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning
- URL: http://arxiv.org/abs/2107.05382v1
- Date: Wed, 7 Jul 2021 12:52:49 GMT
- Title: End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning
- Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota
Orihashi, Naoki Makishima
- Abstract summary: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems.
The Key process in our learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset.
Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
- Score: 28.516240952627076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a semi-supervised learning method for building end-to-end rich
transcription-style automatic speech recognition (RT-ASR) systems from
small-scale rich transcription-style and large-scale common transcription-style
datasets. In spontaneous speech tasks, various speech phenomena such as
fillers, word fragments, laughter and coughs, etc. are often included. While
common transcriptions do not give special awareness to these phenomena, rich
transcriptions explicitly convert them into special phenomenon tokens as well
as textual tokens. In previous studies, the textual and phenomenon tokens were
simultaneously estimated in an end-to-end manner. However, it is difficult to
build accurate RT-ASR systems because large-scale rich transcription-style
datasets are often unavailable. To solve this problem, our training method uses
a limited rich transcription-style dataset and common transcription-style
dataset simultaneously. The Key process in our semi-supervised learning is to
convert the common transcription-style dataset into a pseudo-rich
transcription-style dataset. To this end, we introduce style tokens which
control phenomenon tokens are generated or not into transformer-based
autoregressive modeling. We use this modeling for generating the pseudo-rich
transcription-style datasets and for building RT-ASR system from the pseudo and
original datasets. Our experiments on spontaneous ASR tasks showed the
effectiveness of the proposed method.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - FLIP: Towards Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models (FLIP) for click-through rate (CTR) prediction.
Specifically, the masked data of one modality (i.e., tokens or features) has to be recovered with the help of the other modality, which establishes the feature-level interaction and alignment.
Experiments on three real-world datasets demonstrate that FLIP outperforms SOTA baselines, and is highly compatible for various ID-based models and PLMs.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Improving Data Driven Inverse Text Normalization using Data Augmentation [14.820077884045645]
Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form.
We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data.
We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
arXiv Detail & Related papers (2022-07-20T06:07:26Z) - Context-Aware Transformer Transducer for Speech Recognition [21.916660252023707]
We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals.
We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
arXiv Detail & Related papers (2021-11-05T04:14:35Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - TS-Net: OCR Trained to Switch Between Text Transcription Styles [0.0]
We propose to extend existing text recognition networks with a Transcription Style Block (TSB)
TSB can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules.
We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data.
arXiv Detail & Related papers (2021-03-09T15:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.