Multimodal Action Quality Assessment
- URL: http://arxiv.org/abs/2402.09444v2
- Date: Tue, 20 Feb 2024 06:05:37 GMT
- Title: Multimodal Action Quality Assessment
- Authors: Ling-An Zeng and Wei-Shi Zheng
- Abstract summary: Action quality assessment (AQA) is to assess how well an action is performed.
We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy.
We propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information.
- Score: 40.10252351858076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action quality assessment (AQA) is to assess how well an action is performed.
Previous works perform modelling by only the use of visual information,
ignoring audio information. We argue that although AQA is highly dependent on
visual information, the audio is useful complementary information for improving
the score regression accuracy, especially for sports with background music,
such as figure skating and rhythmic gymnastics. To leverage multimodal
information for AQA, i.e., RGB, optical flow and audio information, we propose
a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models
modality-specific information and mixed-modality information. Our model
consists of with three modality-specific branches that independently explore
modality-specific information and a mixed-modality branch that progressively
aggregates the modality-specific information from the modality-specific
branches. To build the bridge between modality-specific branches and the
mixed-modality branch, three novel modules are proposed. First, a
Modality-specific Feature Decoder module is designed to selectively transfer
modality-specific information to the mixed-modality branch. Second, when
exploring the interaction between modality-specific information, we argue that
using an invariant multimodal fusion policy may lead to suboptimal results, so
as to take the potential diversity in different parts of an action into
consideration. Therefore, an Adaptive Fusion Module is proposed to learn
adaptive multimodal fusion policies in different parts of an action. This
module consists of several FusionNets for exploring different multimodal fusion
strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a
module called Cross-modal Feature Decoder is designed to transfer cross-modal
features generated by Adaptive Fusion Module to the mixed-modality branch.
Related papers
- StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment [27.28214706269035]
Multi-modal entity alignment (MMEA) aims to identify equivalent entity pairs across different multi-modal knowledge graphs (MMKGs)
In this paper, we propose a Multi-Grained Interaction framework for Multi-Modal Entity alignment.
arXiv Detail & Related papers (2024-04-19T08:43:11Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - MMSFormer: Multimodal Transformer for Material and Semantic Segmentation [16.17270247327955]
We propose a novel fusion strategy that can effectively fuse information from different modality combinations.
We also propose a new model named Multi-Modal TransFormer (MMSFormer) that incorporates the proposed fusion strategy.
MMSFormer outperforms current state-of-the-art models on three different datasets.
arXiv Detail & Related papers (2023-09-07T20:07:57Z) - A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences [16.38826799727453]
How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences.
Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
arXiv Detail & Related papers (2022-11-12T13:05:28Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.