Cross-modal variational inference for bijective signal-symbol
translation
- URL: http://arxiv.org/abs/2002.03862v1
- Date: Mon, 10 Feb 2020 15:25:48 GMT
- Title: Cross-modal variational inference for bijective signal-symbol
translation
- Authors: Axel Chemla--Romeu-Santos, Stavros Ntalampiras, Philippe Esling,
Goffredo Haus, G\'erard Assayag
- Abstract summary: In this paper, we propose an approach for signal/symbol translation by turning this problem into a density estimation task.
We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint.
In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation.
- Score: 11.444576186559486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extraction of symbolic information from signals is an active field of
research enabling numerous applications especially in the Musical Information
Retrieval domain. This complex task, that is also related to other topics such
as pitch extraction or instrument recognition, is a demanding subject that gave
birth to numerous approaches, mostly based on advanced signal processing-based
algorithms. However, these techniques are often non-generic, allowing the
extraction of definite physical properties of the signal (pitch, octave), but
not allowing arbitrary vocabularies or more general annotations. On top of
that, these techniques are one-sided, meaning that they can extract symbolic
data from an audio signal, but cannot perform the reverse process and make
symbol-to-signal generation. In this paper, we propose an bijective approach
for signal/symbol translation by turning this problem into a density estimation
task over signal and symbolic domains, considered both as related random
variables. We estimate this joint distribution with two different variational
auto-encoders, one for each domain, whose inner representations are forced to
match with an additive constraint, allowing both models to learn and generate
separately while allowing signal-to-symbol and symbol-to-signal inference. In
this article, we test our models on pitch, octave and dynamics symbols, which
comprise a fundamental step towards music transcription and label-constrained
audio generation. In addition to its versatility, this system is rather light
during training and generation while allowing several interesting creative uses
that we outline at the end of the article.
Related papers
- Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Learning and controlling the source-filter representation of speech with
a variational autoencoder [23.05989605017053]
In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors.
We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces.
Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
arXiv Detail & Related papers (2022-04-14T16:13:06Z) - Meta-Learning Sparse Implicit Neural Representations [69.15490627853629]
Implicit neural representations are a promising new avenue of representing general signals.
Current approach is difficult to scale for a large number of signals or a data set.
We show that meta-learned sparse neural representations achieve a much smaller loss than dense meta-learned models.
arXiv Detail & Related papers (2021-10-27T18:02:53Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - A Signal-Centric Perspective on the Evolution of Symbolic Communication [4.447467536572625]
We show how organisms can evolve to define a shared set of symbols with unique interpretable meaning.
We characterize signal decoding as either regression or classification, with limited and unlimited signal amplitude.
In various settings, we observe agents evolving to share a dictionary of symbols, with each symbol spontaneously associated to a 1-D unique signal.
arXiv Detail & Related papers (2021-03-31T08:05:01Z) - Discriminative Singular Spectrum Classifier with Applications on
Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently.
Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces.
The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z) - Variable-rate discrete representation learning [20.81400194698063]
We propose slow autoencoders for unsupervised learning of high-level variable-rate discrete representations of sequences.
We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals.
We develop run-length Transformers for event-based representation modelling and use them to construct language models in the speech domain.
arXiv Detail & Related papers (2021-03-10T14:42:31Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.