MTCRNN: A multi-scale RNN for directed audio texture synthesis
- URL: http://arxiv.org/abs/2011.12596v1
- Date: Wed, 25 Nov 2020 09:13:53 GMT
- Title: MTCRNN: A multi-scale RNN for directed audio texture synthesis
- Authors: M. Huzaifah, L. Wyse
- Abstract summary: We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis.
We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Audio textures are a subset of environmental sounds, often defined as having
stable statistical characteristics within an adequately large window of time
but may be unstructured locally. They include common everyday sounds such as
from rain, wind, and engines. Given that these complex sounds contain patterns
on multiple timescales, they are a challenge to model with traditional methods.
We introduce a novel modelling approach for textures, combining recurrent
neural networks trained at different levels of abstraction with a conditioning
strategy that allows for user-directed synthesis. We demonstrate the model's
performance on a variety of datasets, examine its performance on various
metrics, and discuss some potential applications.
Related papers
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - Learning in a Single Domain for Non-Stationary Multi-Texture Synthesis [9.213030142986417]
Non-stationary textures have large scale variance and can hardly be synthesized through one model.
We propose a multi-scale generator to capture structural patterns of various scales and effectively synthesize textures with a minor cost.
We present a category-specific training strategy to focus on learning texture pattern of a specific domain.
arXiv Detail & Related papers (2023-05-10T14:32:21Z) - Rigid-Body Sound Synthesis with Differentiable Modal Resonators [6.680437329908454]
We present a novel end-to-end framework for training a deep neural network to generate modal resonators for a given 2D shape and material.
We demonstrate our method on a dataset of synthetic objects, but train our model using an audio-domain objective.
arXiv Detail & Related papers (2022-10-27T10:34:38Z) - Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS [0.0]
We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
arXiv Detail & Related papers (2022-10-24T14:18:43Z) - Adversarial Audio Synthesis with Complex-valued Polynomial Networks [60.231877895663956]
Time-frequency (TF) representations in audio have been increasingly modeled real-valued networks.
We introduce complex-valued networks called APOLLO, that integrate such complex-valued representations in a natural way.
APOLLO results in $17.5%$ improvement over adversarial methods and $8.2%$ over the state-of-the-art diffusion models on SC09 in audio generation.
arXiv Detail & Related papers (2022-06-14T12:58:59Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.