Audio Texture Manipulation by Exemplar-Based Analogy
- URL: http://arxiv.org/abs/2501.12385v1
- Date: Tue, 21 Jan 2025 18:58:38 GMT
- Title: Audio Texture Manipulation by Exemplar-Based Analogy
- Authors: Kan Jen Cheng, Tingle Li, Gopala Anumanchipalli,
- Abstract summary: We propose an exemplar-based analogy model for audio texture manipulation.
Instead of conditioning on text-based instructions, our method uses paired speech examples.
We show through evaluations and perceptual studies that our model outperforms text-conditioned baselines.
- Score: 3.7209456282942734
- License:
- Abstract: Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the original sound and another illustrates the desired transformation. The model learns to apply the same transformation to new input, allowing for the manipulation of sound textures. We construct a quadruplet dataset representing various editing tasks, and train a latent diffusion model in a self-supervised manner. We show through quantitative evaluations and perceptual studies that our model outperforms text-conditioned baselines and generalizes to real-world, out-of-distribution, and non-speech scenarios. Project page: https://berkeley-speech-group.github.io/audio-texture-analogy/
Related papers
- TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Learning Visual Styles from Audio-Visual Associations [21.022027778790978]
We present a method for learning visual styles from unlabeled audio-visual data.
Our model learns to manipulate the texture of a scene to match a sound.
We show that audio can be an intuitive representation for manipulating images.
arXiv Detail & Related papers (2022-05-10T17:57:07Z) - Sound-Guided Semantic Image Manipulation [19.01823634838526]
We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space.
Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification.
The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T13:30:12Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - What all do audio transformer models hear? Probing Acoustic
Representations for Language Delivery and its Structure [64.54208910952651]
We compare audio transformer models Mockingjay and wave2vec2.0.
We probe the audio models' understanding of textual surface, syntax, and semantic features.
We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets.
arXiv Detail & Related papers (2021-01-02T06:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.