From Audio to Symbolic Encoding
- URL: http://arxiv.org/abs/2302.13401v1
- Date: Sun, 26 Feb 2023 20:15:00 GMT
- Title: From Audio to Symbolic Encoding
- Authors: Shenli Yuan, Lingjie Kong, and Jiushuang Guo
- Abstract summary: We introduce our new neural network architecture built on top of the current state-of-the-art Onsets and Frames.
For AMT, our models were able to produce better results compared to the model trained using the state-of-art architecture.
Although similar architecture was able to be trained on the speech recognition task, it did not generate very ideal result.
- Score: 2.064612766965483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic music transcription (AMT) aims to convert raw audio to symbolic
music representation. As a fundamental problem of music information retrieval
(MIR), AMT is considered a difficult task even for trained human experts due to
overlap of multiple harmonics in the acoustic signal. On the other hand, speech
recognition, as one of the most popular tasks in natural language processing,
aims to translate human spoken language to texts. Based on the similar nature
of AMT and speech recognition (as they both deal with tasks of translating
audio signal to symbolic encoding), this paper investigated whether a generic
neural network architecture could possibly work on both tasks. In this paper,
we introduced our new neural network architecture built on top of the current
state-of-the-art Onsets and Frames, and compared the performances of its
multiple variations on AMT task. We also tested our architecture with the task
of speech recognition. For AMT, our models were able to produce better results
compared to the model trained using the state-of-art architecture; however,
although similar architecture was able to be trained on the speech recognition
task, it did not generate very ideal result compared to other task-specific
models.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network.
It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model.
It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z) - EnCodecMAE: Leveraging neural codecs for universal audio representation learning [16.590638305972632]
We propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments.
We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds.
arXiv Detail & Related papers (2023-09-14T02:21:53Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.