Related papers: Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

URL: http://arxiv.org/abs/2406.06251v1
Date: Mon, 10 Jun 2024 13:31:18 GMT
Title: Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu,
Abstract summary: Voicebox Adapter is a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance.
Score: 43.43337861152684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups.

Related papers

Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z)
When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining [5.717224738376866]
In this work, we propose bypassing the pretraining stage by directly fine-tuning the vision model with our Look Aside Adapter (LoAA) Our experiments demonstrate that our adapters allow vision models to reach or surpass the performance of pretrained audio models in various audio and speech tasks.
arXiv Detail & Related papers (2024-12-08T14:14:30Z)
Pre-training Feature Guided Diffusion Model for Speech Enhancement [37.88469730135598]
Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments. We introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement.
arXiv Detail & Related papers (2024-06-11T18:22:59Z)
Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification [38.20393847192532]
Self-supervised speech models have shown impressive performance on various downstream speech tasks. fine-tuning becomes practically unfeasible due to heavy computation and storage overhead. We propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task.
arXiv Detail & Related papers (2024-03-01T05:32:14Z)
DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z)
Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z)
Audio Contrastive based Fine-tuning [21.145936249583446]
We introduce Audio Contrastive-based Fine-tuning (AudioConFit) as an efficient approach characterised by robust generalisability. Empirical experiments on a variety of audio classification tasks demonstrate the effectiveness and robustness of our approach.
arXiv Detail & Related papers (2023-09-21T08:59:13Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
Representative Subset Selection for Efficient Fine-Tuning in Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z)
Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection [25.05285328404576]
optimizing speech towards a particular test-time speaker can improve performance and reduce run-time complexity. We propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined.
arXiv Detail & Related papers (2021-05-08T00:15:57Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.