Related papers: Stable Audio Open

Related papers

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens [62.56027815951259]
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens.<n>This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale.
arXiv Detail & Related papers (2026-02-18T18:32:46Z)
MOVA: Towards Scalable and Synchronized Video-Audio Generation [91.56945636522345]
We introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content.<n>By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
arXiv Detail & Related papers (2026-02-09T15:31:54Z)
SAM Audio: Segment Anything in Audio [55.50609519820557]
General audio source separation is a key capability for multimodal AI systems.<n>We present SAM Audio, a foundation model for general audio separation.<n>It unifies text, visual, and temporal span prompting within a single framework.
arXiv Detail & Related papers (2025-12-19T22:14:23Z)
Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data [4.736913024290765]
Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark.<n>Our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters.
arXiv Detail & Related papers (2025-09-09T09:01:01Z)
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts [59.38012380516272]
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video.<n>To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique.
arXiv Detail & Related papers (2025-09-07T17:55:03Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention. We propose new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z)
Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? [40.3708221702947]
We aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. We also investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling.
arXiv Detail & Related papers (2024-06-13T04:33:05Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models. It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z)
Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z)
StemGen: A music generation model that listens [9.489938613869864]
We present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.
arXiv Detail & Related papers (2023-12-14T08:09:20Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech. Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z)
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.