Related papers: Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability

Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability

URL: http://arxiv.org/abs/2502.12672v1
Date: Tue, 18 Feb 2025 09:23:42 GMT
Title: Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability
Authors: Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee,
Abstract summary: Speech-FT is a strategy for speech representation models that leverages model merging to preserve generalization ability while still benefiting from fine-tuning.<n> Speech-FT is effective across different fine-tuning scenarios and is compatible with various types of speech representation models.
Score: 51.56024241398741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech representation models are highly effective at extracting general features for various tasks. While fine-tuning can enhance these representations for specific applications, it often compromises their generalization ability. To address this challenge, we propose Speech-FT, a fine-tuning strategy for speech representation models that leverages model merging to preserve generalization ability while still benefiting from fine-tuning. Speech-FT is effective across different fine-tuning scenarios and is compatible with various types of speech representation models, providing a versatile solution. Speech-FT offers an efficient and practical approach to further improving general speech representations after pre-training.

Related papers

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
GPT-4o is an omni-modal model that enables vocal conversations with diverse emotions and tones. We propose EMOVA to enable Large Language Models with end-to-end speech capabilities. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z)
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models [3.1511847280063696]
Speech enabled foundation models can perform tasks other than automatic speech recognition using an appropriate prompt. With the development of audio-prompted large language models there is the potential for even greater control options. We demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks.
arXiv Detail & Related papers (2024-07-05T13:04:31Z)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities. Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation. SpeechGPT-Gen is efficient in semantic and perceptual information modeling. It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z)
uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z)
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [39.07096632751864]
SpeechGPT is a large language model with intrinsic cross-modal conversational abilities. We employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning.
arXiv Detail & Related papers (2023-05-18T14:23:25Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.