Automated Audio Captioning via Fusion of Low- and High- Dimensional
Features
- URL: http://arxiv.org/abs/2210.05037v1
- Date: Mon, 10 Oct 2022 22:39:41 GMT
- Title: Automated Audio Captioning via Fusion of Low- and High- Dimensional
Features
- Authors: Jianyuan Sun and Xubo Liu and Xinhao Mei and Mark D. Plumbley and
Volkan Kilic and Wenwu Wang
- Abstract summary: Existing AAC methods only use the high-dimensional representation of the PANNs as the input of the decoder.
A new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC.
LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models.
- Score: 48.62190893209622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated audio captioning (AAC) aims to describe the content of an audio
clip using simple sentences. Existing AAC methods are developed based on an
encoder-decoder architecture that success is attributed to the use of a
pre-trained CNN10 called PANNs as the encoder to learn rich audio
representations. AAC is a highly challenging task due to its high-dimensional
talent space involves audio of various scenarios. Existing methods only use the
high-dimensional representation of the PANNs as the input of the decoder.
However, the low-dimension representation may retain as much audio information
as the high-dimensional representation may be neglected. In addition, although
the high-dimensional approach may predict the audio captions by learning from
existing audio captions, which lacks robustness and efficiency. To deal with
these challenges, a fusion model which integrates low- and high-dimensional
features AAC framework is proposed. In this paper, a new encoder-decoder
framework is proposed called the Low- and High-Dimensional Feature Fusion
(LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed
called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the
intermediate convolution layer output and the high-dimensional feature from the
final layer output of PANNs. To fully explore the information of the low- and
high-dimensional fusion feature and high-dimensional feature respectively, we
proposed dual transformer decoder structures to generate the captions in
parallel. Especially, a probabilistic fusion approach is proposed that can
ensure the overall performance of the system is improved by concentrating on
the respective advantages of the two transformer decoders. Experimental results
show that LHDFF achieves the best performance on the Clotho and AudioCaps
datasets compared with other existing models
Related papers
- DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion [1.292190360867547]
Current Audio-Visual Source Separation methods primarily adopt two design strategies.
The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the decoder.
The second strategy avoids direct fusion and instead relies on the decoder to handle the interaction between audio and visual features.
This paper proposes a dynamic fusion method based on a gating mechanism that dynamically adjusts the modality fusion degree.
arXiv Detail & Related papers (2025-04-30T06:55:24Z) - Plug-and-Play Versatile Compressed Video Enhancement [57.62582951699999]
Video compression effectively reduces the size of files, making it possible for real-time cloud computing.
However, it comes at the cost of visual quality, challenges the robustness of downstream vision models.
We present a versatile-aware enhancement framework that adaptively enhance videos under different compression settings.
arXiv Detail & Related papers (2025-04-21T18:39:31Z) - Breaking the Encoder Barrier for Seamless Video-Language Understanding [22.749949819082484]
We propose ELVA, an encoder-free-LLM that directly models nuanced video-language interactions without relying on a vision encoder.
With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95% and inference latency by 92%.
arXiv Detail & Related papers (2025-03-24T08:06:39Z) - REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.
Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.
We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency.
Uses hierarchical predictive coding to transform each video frame into multiscale representations.
Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos.
Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs.
A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - Efficient VVC Intra Prediction Based on Deep Feature Fusion and
Probability Estimation [57.66773945887832]
We propose to optimize Versatile Video Coding (VVC) complexity at intra-frame prediction, with a two-stage framework of deep feature fusion and probability estimation.
Experimental results on standard database demonstrate the superiority of proposed method, especially for High Definition (HD) and Ultra-HD (UHD) video sequences.
arXiv Detail & Related papers (2022-05-07T08:01:32Z) - Automatic Audio Captioning using Attention weighted Event based
Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.