Interactive Fusion of Multi-level Features for Compositional Activity
Recognition
- URL: http://arxiv.org/abs/2012.05689v1
- Date: Thu, 10 Dec 2020 14:17:18 GMT
- Title: Interactive Fusion of Multi-level Features for Compositional Activity
Recognition
- Authors: Rui Yan, Lingxi Xie, Xiangbo Shu, and Jinhui Tang
- Abstract summary: We present a novel framework that accomplishes this goal by interactive fusion.
We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction.
We evaluate our approach on two action recognition datasets, Something-Something and Charades.
- Score: 100.75045558068874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To understand a complex action, multiple sources of information, including
appearance, positional, and semantic features, need to be integrated. However,
these features are difficult to be fused since they often differ significantly
in modality and dimensionality. In this paper, we present a novel framework
that accomplishes this goal by interactive fusion, namely, projecting features
across different spaces and guiding it using an auxiliary prediction task.
Specifically, we implement the framework in three steps, namely,
positional-to-appearance feature extraction, semantic feature interaction, and
semantic-to-positional prediction. We evaluate our approach on two action
recognition datasets, Something-Something and Charades. Interactive fusion
achieves consistent accuracy gain beyond off-the-shelf action recognition
algorithms. In particular, on Something-Else, the compositional setting of
Something-Something, interactive fusion reports a remarkable gain of 2.9% in
terms of top-1 accuracy.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Semantic-aware Video Representation for Few-shot Action Recognition [1.6486717871944268]
We propose a simple yet effective Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these issues.
We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance.
Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.
arXiv Detail & Related papers (2023-11-10T18:13:24Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - A Hierarchical Interactive Network for Joint Span-based Aspect-Sentiment
Analysis [34.1489054082536]
We propose a hierarchical interactive network (HI-ASA) to model two-way interactions between two tasks appropriately.
We use cross-stitch mechanism to combine the different task-specific features selectively as the input to ensure proper two-way interactions.
Experiments on three real-world datasets demonstrate HI-ASA's superiority over baselines.
arXiv Detail & Related papers (2022-08-24T03:03:49Z) - FINet: Dual Branches Feature Interaction for Partial-to-Partial Point
Cloud Registration [31.014309817116175]
We present FINet, a feature interaction-based structure with the capability to enable and strengthen the information associating between the inputs at multiple stages.
Experiments demonstrate that our method performs higher precision and robustness compared to the state-of-the-art traditional and learning-based methods.
arXiv Detail & Related papers (2021-06-07T10:15:02Z) - DCR-Net: A Deep Co-Interactive Relation Network for Joint Dialog Act
Recognition and Sentiment Classification [77.59549450705384]
In dialog system, dialog act recognition and sentiment classification are two correlative tasks.
Most of the existing systems either treat them as separate tasks or just jointly model the two tasks.
We propose a Deep Co-Interactive Relation Network (DCR-Net) to explicitly consider the cross-impact and model the interaction between the two tasks.
arXiv Detail & Related papers (2020-08-16T14:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.