CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios
- URL: http://arxiv.org/abs/2403.04640v1
- Date: Thu, 7 Mar 2024 16:31:02 GMT
- Title: CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios
- Authors: Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao
- Abstract summary: This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components.
We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways.
CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
- Score: 69.94398424864595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on the challenge of answering questions in scenarios that
are composed of rich and complex dynamic audio-visual components. Although
existing Multimodal Large Language Models (MLLMs) can respond to audio-visual
content, these responses are sometimes ambiguous and fail to describe specific
audio-visual events. To overcome this limitation, we introduce the CAT, which
enhances MLLM in three ways: 1) besides straightforwardly bridging audio and
video, we design a clue aggregator that aggregates question-related clues in
dynamic audio-visual scenarios to enrich the detailed knowledge required for
large language models. 2) CAT is trained on a mixed multimodal dataset,
allowing direct application in audio-visual scenarios. Notably, we collect an
audio-visual joint instruction dataset named AVinstruct, to further enhance the
capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted
ambiguity-aware direct preference optimization, a strategy specialized in
retraining the model to favor the non-ambiguity response and improve the
ability to localize specific audio-visual objects. Extensive experimental
results demonstrate that CAT outperforms existing methods on multimodal tasks,
especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the
collected instructions are released at https://github.com/rikeilong/Bay-CAT.
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio.
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z) - Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z) - Fine-grained Audio-Visual Joint Representations for Multimodal Large
Language Models [25.660343393359565]
This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM)
FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level.
An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
arXiv Detail & Related papers (2023-10-09T17:00:20Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.