ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer
- URL: http://arxiv.org/abs/2503.20992v1
- Date: Wed, 26 Mar 2025 21:11:17 GMT
- Title: ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer
- Authors: Michael Brown, Sofia Martinez, Priya Singh,
- Abstract summary: We present emphReverBERT, an efficient framework for text-driven speech style transfer.<n>Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features.<n>Experiments on benchmark speech corpora demonstrate that emphReverBERT significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.
Related papers
- Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer [9.010012117838725]
StyleMamba is an efficient image style transfer framework that translates text prompts into corresponding visual styles.
Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources.
arXiv Detail & Related papers (2024-05-08T12:57:53Z) - StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse
Representations and Content Enhancing [73.81778485157234]
Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences.
We formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style.
We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder.
arXiv Detail & Related papers (2022-08-29T08:47:49Z) - Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - STYLER: Style Modeling with Rapidity and Robustness via
SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture.
Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z) - GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained
Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years.
Current approaches still lack the ability to preserve the content and even logic of original sentences.
We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.