Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network
- URL: http://arxiv.org/abs/2207.14568v1
- Date: Fri, 29 Jul 2022 09:29:28 GMT
- Title: Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network
- Authors: Da-rong Liu, Po-chun Hsu, Yi-chen Chen, Sung-feng Huang, Shun-po
Chuang, Da-yi Wu, and Hung-yi Lee
- Abstract summary: This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
- Score: 58.82343017711883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ASR has been shown to achieve great performance recently. However, most of
them rely on massive paired data, which is not feasible for low-resource
languages worldwide. This paper investigates how to learn directly from
unpaired phone sequences and speech utterances. We design a two-stage iterative
framework. GAN training is adopted in the first stage to find the mapping
relationship between unpaired speech and phone sequence. In the second stage,
another HMM model is introduced to train from the generator's output, which
boosts the performance and provides a better segmentation for the next
iteration. In the experiment, we first investigate different choices of model
designs. Then we compare the framework to different types of baselines: (i)
supervised methods (ii) acoustic unit discovery based methods (iii) methods
learning from unpaired data. Our framework performs consistently better than
all acoustic unit discovery methods and previous methods learning from unpaired
data based on the TIMIT dataset.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
Connecting Foundation Models [14.538853403226751]
Building artificial intelligence systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research.
We propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.
Our method only requires a quick training of the V2A-Mapper to produce high-fidelity and visually-aligned sound.
arXiv Detail & Related papers (2023-08-18T04:49:38Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - Active Restoration of Lost Audio Signals Using Machine Learning and
Latent Information [0.7252027234425334]
This paper proposes the combination of steganography, halftoning (dithering), and state-of-the-art shallow and deep learning methods.
We show improvement in the inpainting performance in terms of signal-to-noise ratio (SNR), the objective difference grade (ODG) and Hansen's audio quality metric.
arXiv Detail & Related papers (2021-11-21T20:11:33Z) - Learning from Multiple Noisy Augmented Data Sets for Better
Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages.
Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages.
In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z) - Hybrid Model and Data Driven Algorithm for Online Learning of Any-to-Any
Path Loss Maps [19.963385352536616]
Learning any-to-any path loss maps might be a key enabler for applications that rely on device-to-any (D2D) communication.
Model-based methods have the advantage that they can generate reliable estimations with low computational complexity.
Pure data-driven methods can achieve good performance without assuming any physical model.
We propose a novel hybrid model and data-driven approach that obtained datasets from an online fashion.
arXiv Detail & Related papers (2021-07-14T13:08:25Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.