Anticipative Feature Fusion Transformer for Multi-Modal Action
Anticipation
- URL: http://arxiv.org/abs/2210.12649v1
- Date: Sun, 23 Oct 2022 08:11:03 GMT
- Title: Anticipative Feature Fusion Transformer for Multi-Modal Action
Anticipation
- Authors: Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen,
J\"urgen Beyerer
- Abstract summary: We introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage.
Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches.
We extract audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.
- Score: 19.461279313483683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although human action anticipation is a task which is inherently multi-modal,
state-of-the-art methods on well known action anticipation datasets leverage
this data by applying ensemble methods and averaging scores of unimodal
anticipation networks. In this work we introduce transformer based modality
fusion techniques, which unify multi-modal data at an early stage. Our
Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular
score fusion approaches and presents state-of-the-art results outperforming
previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily
extensible and allows for adding new modalities without architectural changes.
Consequently, we extracted audio features on EpicKitchens-100 which we add to
the set of commonly used features in the community.
Related papers
- Appformer: A Novel Framework for Mobile App Usage Prediction Leveraging Progressive Multi-Modal Data Fusion and Feature Extraction [9.53224378857976]
Appformer is a novel mobile application prediction framework inspired by the efficiency of Transformer-like architectures.
The framework employs Points of Interest (POIs) associated with base stations, optimizing them through comparative experiments to identify the most effective clustering method.
The Feature Extraction Module, employing Transformer-like architectures specialized for time series analysis, adeptly distils comprehensive features.
arXiv Detail & Related papers (2024-07-28T06:41:31Z) - Fine-Grained Scene Image Classification with Modality-Agnostic Adapter [8.801601759337006]
We present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter)
We eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion.
Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods.
arXiv Detail & Related papers (2024-07-03T02:57:14Z) - Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild [37.32217405723552]
We present an approach for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation.
We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features.
Following preprocessing and or convolution to align the extracted features, different models are employed for modal fusion.
arXiv Detail & Related papers (2024-03-22T09:00:24Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Unified Contrastive Fusion Transformer for Multimodal Human Action
Recognition [13.104967563769533]
We introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer)
UCFFormer integrates data with diverse distributions to enhance human action recognition (HAR) performance.
We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer.
arXiv Detail & Related papers (2023-09-10T14:10:56Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.