Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue
State Tracking
- URL: http://arxiv.org/abs/2312.01842v1
- Date: Mon, 4 Dec 2023 12:25:46 GMT
- Title: Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue
State Tracking
- Authors: Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee
- Abstract summary: We develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data.
Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data.
- Score: 19.754211231250544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dialogue state tracking plays a crucial role in extracting information in
task-oriented dialogue systems. However, preceding research are limited to
textual modalities, primarily due to the shortage of authentic human audio
datasets. We address this by investigating synthetic audio data for audio-based
DST. To this end, we develop cascading and end-to-end models, train them with
our synthetic audio dataset, and test them on actual human speech data. To
facilitate evaluation tailored to audio modalities, we introduce a novel
PhonemeF1 to capture pronunciation similarity. Experimental results showed that
models trained solely on synthetic datasets can generalize their performance to
human voice data. By eliminating the dependency on human speech data
collection, these insights pave the way for significant practical advancements
in audio-based DST. Data and code are available at
https://github.com/JihyunLee1/E2E-DST.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Pre-training with Synthetic Patterns for Audio [18.769951782213973]
We propose to pre-train audio encoders using synthetic patterns instead of real audio data.
Our framework achieves performance comparable to models pre-trained on AudioSet-2M.
arXiv Detail & Related papers (2024-10-01T08:52:35Z) - A Framework for Synthetic Audio Conversations Generation using Large Language Models [0.0]
Conversa Synth is a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings.
The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems.
arXiv Detail & Related papers (2024-09-02T05:09:46Z) - Dissecting Temporal Understanding in Text-to-Audio Retrieval [22.17493527005141]
We analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval.
In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets.
We present a loss function that encourages text-audio models to focus on the temporal ordering of events.
arXiv Detail & Related papers (2024-09-01T22:01:21Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Noise Robust TTS for Low Resource Speakers using Pre-trained Model and
Speech Enhancement [31.33429812278942]
The proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively.
Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning multi-speaker speech synthesis model.
arXiv Detail & Related papers (2020-05-26T06:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.