Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model
- URL: http://arxiv.org/abs/2107.01571v1
- Date: Sun, 4 Jul 2021 08:35:20 GMT
- Title: Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model
- Authors: Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan,
Yuexian Zou
- Abstract summary: We propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to fuse the two modalities (audio and textual)
We further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio.
- Score: 51.42415340921237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Machine Comprehension (MC) has attracted extensive research interests
in recent years, existing approaches mainly belong to the category of Machine
Reading Comprehension task which mines textual inputs (paragraphs and
questions) to predict the answers (choices or text spans). However, there are a
lot of MC tasks that accept audio input in addition to the textual input, e.g.
English listening comprehension test. In this paper, we target the problem of
Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer
questions based on the given audio and textual information. To solve this
problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model
to effectively fuse the two modalities (audio and textual). DIIA can work as an
independent component and thus be easily integrated into existing MC models.
Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module
to enable our multimodal MC model to accurately predict the answers based only
on either the text or the audio. As a result, the proposed approach can handle
various tasks including: Audio-Oriented Multimodal Machine Comprehension,
Machine Reading Comprehension and Machine Listening Comprehension, in a single
model, making fair comparisons possible between our model and the existing
unimodal MC models. Experimental results and analysis prove the effectiveness
of the proposed approaches. First, the proposed DIIA boosts the baseline models
by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the
MKD module allows our multimodal MC model to significantly outperform the
unimodal models by up to 18.87%, which are trained and tested with only audio
or textual data.
Related papers
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [0.0]
Mini- Omni2 is a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries.
We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset.
arXiv Detail & Related papers (2024-10-15T02:10:45Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Explore the Limits of Omni-modal Pretraining at Scale [21.82148059125346]
We propose a scalable pretraining paradigm, named Multimodal Context (MiCo)
MiCo can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process.
Our models establish 37 new records for state-of-the-art performance.
arXiv Detail & Related papers (2024-06-13T17:59:53Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
Parallel Data [15.658471125219224]
Multimodal pre-training for audio-and-text has been proven to be effective and has significantly improved the performance of many downstream speech understanding tasks.
However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data.
In this paper, we investigate whether it is possible to pre-train an audio-text model with low-resource parallel data.
arXiv Detail & Related papers (2022-04-10T10:25:37Z) - Leveraging Uni-Modal Self-Supervised Learning for Multimodal
Audio-Visual Speech Recognition [23.239078852797817]
We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
arXiv Detail & Related papers (2022-02-24T15:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.