Related papers: Automatic Embedding of Stories Into Collections of Independent Media

Related papers

Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation [49.062766449989525]
Generative models of music audio are typically used to generate output based solely on a text prompt or melody.<n>Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model.
arXiv Detail & Related papers (2025-07-07T10:46:07Z)
AI-Generated Song Detection via Lyrics Transcripts [15.1799390517192]
Recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry.<n>We propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models.<n>Our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways.
arXiv Detail & Related papers (2025-06-23T10:42:50Z)
ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation [72.22243595269389]
This paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent.<n>The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production.<n>Our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
arXiv Detail & Related papers (2025-03-10T11:57:55Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata [6.230204066837519]
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata.
arXiv Detail & Related papers (2025-02-11T11:12:19Z)
Language-based Audio Moment Retrieval [14.227865973426843]
We propose and design a new task called audio moment retrieval (AMR) Unlike conventional language-based audio retrieval tasks, AMR aims to predict relevant moments in untrimmed long audio based on a text query. We build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks.
arXiv Detail & Related papers (2024-09-24T02:24:48Z)
LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation [49.89372182441713]
We introduce LARP, a multi-modal cold-start playlist continuation model. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss.
arXiv Detail & Related papers (2024-06-20T14:02:15Z)
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling [62.25533750469467]
We propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of videos available on modern streaming services.
arXiv Detail & Related papers (2024-01-22T15:26:01Z)
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z)
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models [54.55063772090821]
MusicAgent integrates numerous music-related tools and an autonomous workflow to address user requirements. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect.
arXiv Detail & Related papers (2023-10-18T13:31:10Z)
Follow Anything: Open-set detection, tracking, and following in real-time [89.83421771766682]
We present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything'' (FAn), is an open-vocabulary and multimodal model. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second.
arXiv Detail & Related papers (2023-08-10T17:57:06Z)
Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z)
Malakai: Music That Adapts to the Shape of Emotions [0.0]
Malakai is a tool that helps users to create, listen, remix and share such dynamic songs. Using Malakai, a Composer can create a dynamic song that can be interacted with by a Listener.
arXiv Detail & Related papers (2021-12-03T18:34:54Z)
Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging [8.658926288789164]
We present a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation.
arXiv Detail & Related papers (2021-01-30T10:13:10Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.