VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation
- URL: http://arxiv.org/abs/2307.03918v1
- Date: Sat, 8 Jul 2023 06:49:54 GMT
- Title: VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation
- Authors: Congqi Cao and Ze Sun and Qinyi Lv and Lingtong Min and Yanning Zhang
- Abstract summary: Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
- Score: 33.41226268323332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric action anticipation is a challenging task that aims to make
advanced predictions of future actions from current and historical observations
in the first-person view. Most existing methods focus on improving the model
architecture and loss function based on the visual input and recurrent neural
network to boost the anticipation performance. However, these methods, which
merely consider visual information and rely on a single network architecture,
gradually reach a performance plateau. In order to fully understand what has
been observed and capture the dependencies between current observations and
future actions well enough, we propose a novel visual-semantic fusion enhanced
and Transformer GRU-based action anticipation framework in this paper. Firstly,
high-level semantic information is introduced to improve the performance of
action anticipation for the first time. We propose to use the semantic features
generated based on the class labels or directly from the visual observations to
augment the original visual features. Secondly, an effective visual-semantic
fusion module is proposed to make up for the semantic gap and fully utilize the
complementarity of different modalities. Thirdly, to take advantage of both the
parallel and autoregressive models, we design a Transformer based encoder for
long-term sequential modeling and a GRU-based decoder for flexible iteration
decoding. Extensive experiments on two large-scale first-person view datasets,
i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed
method, which achieves new state-of-the-art performance, outperforming previous
approaches by a large margin.
Related papers
- A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks [43.98557963966335]
Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training data from released models by utilizing output information.
Recent advances in generative adversarial networks (GANs) have contributed significantly to the improved performance of MI attacks.
We propose a novel method, Intermediate Features enhanced Generative Model Inversion (IF-GMI), which disassembles the GAN structure and exploits features between intermediate blocks.
arXiv Detail & Related papers (2024-07-18T19:16:22Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Inducing Semantic Grouping of Latent Concepts for Explanations: An
Ante-Hoc Approach [18.170504027784183]
We show that by exploiting latent and properly modifying different parts of the model can result better explanation as well as provide superior predictive performance.
We also proposed a technique of using two different self-supervision techniques to extract meaningful concepts related to the type of self-supervision considered.
arXiv Detail & Related papers (2021-08-25T07:09:57Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.