TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
Classification
- URL: http://arxiv.org/abs/2309.11845v2
- Date: Tue, 26 Sep 2023 08:03:48 GMT
- Title: TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
Classification
- Authors: Meng Liu, Ke Liang, Dayu Hu, Hao Yu, Yue Liu, Lingyuan Meng, Wenxuan
Tu, Sihang Zhou, Xinwang Liu
- Abstract summary: We propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac.
In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments.
Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance.
- Score: 60.038979555455775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audiovisual data is everywhere in this digital age, which raises higher
requirements for the deep learning models developed on them. To well handle the
information of the multi-modal data is the key to a better audiovisual modal.
We observe that these audiovisual data naturally have temporal attributes, such
as the time information for each frame in the video. More concretely, such data
is inherently multi-modal according to both audio and visual cues, which
proceed in a strict chronological order. It indicates that temporal information
is important in multi-modal acoustic event modeling for both intra- and
inter-modal. However, existing methods deal with each modal feature
independently and simply fuse them together, which neglects the mining of
temporal relation and thus leads to sub-optimal performance. With this
motivation, we propose a Temporal Multi-modal graph learning method for
Acoustic event Classification, called TMac, by modeling such temporal
information via graph learning techniques. In particular, we construct a
temporal graph for each acoustic event, dividing its audio data and video data
into multiple segments. Each segment can be considered as a node, and the
temporal relationships between nodes can be considered as timestamps on their
edges. In this case, we can smoothly capture the dynamic information in
intra-modal and inter-modal. Several experiments are conducted to demonstrate
TMac outperforms other SOTA models in performance. Our code is available at
https://github.com/MGitHubL/TMac.
Related papers
- OMCAT: Omni Context Aware Transformer [27.674943980306423]
OCTAV is a novel dataset designed to capture event transitions across audio and video.
OMCAT is a powerful model that leverages RoTE to enhance temporal grounding and computational efficiency in time-anchored tasks.
Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment.
arXiv Detail & Related papers (2024-10-15T23:16:28Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Multi-level Attention Fusion Network for Audio-visual Event Recognition [6.767885381740951]
Event classification is inherently sequential and multimodal.
Deep neural models need to dynamically focus on the most relevant time window and/or modality of a video.
We propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition.
arXiv Detail & Related papers (2021-06-12T10:24:52Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.