Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
- URL: http://arxiv.org/abs/2309.11500v4
- Date: Mon, 9 Sep 2024 14:52:15 GMT
- Title: Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
- Authors: Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie,
- Abstract summary: We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
- Score: 50.28566759231076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train widely used models on our dataset and show performance improvement on various downstream tasks, for example, audio-language retrieval, audio captioning, zero-shot classification. In addition, we establish a novel benchmark with environmental information and provide a benchmark for audio-text tasks.
Related papers
- AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Evaluating Off-the-Shelf Machine Listening and Natural Language Models
for Automated Audio Captioning [16.977616651315234]
A captioning system has to identify various information from the input signal and express it with natural language.
We evaluate the performance of off-the-shelf models with a Transformer-based captioning approach.
arXiv Detail & Related papers (2021-10-14T14:42:38Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.