A Language Model With Million Sample Context For Raw Audio Using
Transformer Architectures
- URL: http://arxiv.org/abs/2206.08297v2
- Date: Tue, 16 May 2023 20:50:56 GMT
- Title: A Language Model With Million Sample Context For Raw Audio Using
Transformer Architectures
- Authors: Prateek Verma
- Abstract summary: We propose a generative auto-regressive architecture that can model audio waveforms over a large context.
Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders.
We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN.
- Score: 2.8935588665357077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling long-term dependencies for audio signals is a particularly
challenging problem, as even small-time scales yield on the order of a hundred
thousand samples. With the recent advent of Transformers, neural architectures
became good at modeling dependencies over longer time scales, but they suffered
from quadratic constraints to scale them. We propose a generative
auto-regressive architecture that can model audio waveforms over quite a large
context, greater than 500,000 samples. Our work is adapted to learn time
dependencies by learning a latent representation by a CNN front-end, and then
learning dependencies over these representations using Transformer encoders,
fully trained end-to-end: thereby allowing to learn representations as it deems
fit for the next sample. Unlike previous works that compared different time
scales to show improvement, we use a standard dataset, with the same number of
parameters/context to show improvements. We achieve a state-of-the-art
performance as compared to other approaches such as Wavenet, SaSHMI, and
Sample-RNN on a standard dataset for modeling long-term structure. This work
gives very exciting direction for the field, given improvements in context
modeling that can be scaled with more data, as well as potentially better
results by using billions/trillions of parameters.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Exploring the design space of deep-learning-based weather forecasting systems [56.129148006412855]
This paper systematically analyzes the impact of different design choices on deep-learning-based weather forecasting systems.
We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models.
We propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures.
arXiv Detail & Related papers (2024-10-09T22:25:50Z) - Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM)
During pre-training, we curate large-scale datasets with up to 1 billion time points.
To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies [0.0]
We compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation.
We then demonstrate that long context length does yield better performance, albeit application-dependent.
Inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies.
arXiv Detail & Related papers (2023-02-13T09:47:31Z) - Generative time series models using Neural ODE in Variational
Autoencoders [0.0]
We implement Neural Ordinary Differential Equations in a Variational Autoencoder setting for generative time series modeling.
An object-oriented approach to the code was taken to allow for easier development and research.
arXiv Detail & Related papers (2022-01-12T14:38:11Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Long-Span Dependencies in Transformer-based Summarization Systems [38.672160430296536]
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization.
One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows.
In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization.
arXiv Detail & Related papers (2021-05-08T23:53:03Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.