Related papers: Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

URL: http://arxiv.org/abs/2310.05863v2
Date: Tue, 10 Oct 2023 05:30:49 GMT
Title: Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Authors: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Abstract summary: This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM) FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
Score: 25.660343393359565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.

Related papers

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task. Our framework incorporates two key components for video understanding and cross-modal learning. Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio. Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z)
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well. Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z)
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding [33.85362137961572]
We introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering. We develop AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens.
arXiv Detail & Related papers (2024-03-24T19:50:49Z)
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM) VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation. We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z)
Audio-Visual LLM for Video Understanding [25.963166809113005]
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. We introduce a high-quality video instruction dataset, derived from GPT-4. Experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks.
arXiv Detail & Related papers (2023-12-11T02:50:46Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [61.80870130860662]
Video-LLaMA is a framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses.
arXiv Detail & Related papers (2023-06-05T13:17:27Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.