Entity-aware and Motion-aware Transformers for Language-driven Action
Localization in Videos
- URL: http://arxiv.org/abs/2205.05854v1
- Date: Thu, 12 May 2022 03:00:40 GMT
- Title: Entity-aware and Motion-aware Transformers for Language-driven Action
Localization in Videos
- Authors: Shuo Yang and Xinxiao Wu
- Abstract summary: We propose entity-aware and motion-aware Transformers that progressively localizes actions in videos.
The entity-aware Transformer incorporates the textual entities into visual representation learning.
The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales.
- Score: 29.81187528951681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-driven action localization in videos is a challenging task that
involves not only visual-linguistic matching but also action boundary
prediction. Recent progress has been achieved through aligning language query
to video segments, but estimating precise boundaries is still under-explored.
In this paper, we propose entity-aware and motion-aware Transformers that
progressively localizes actions in videos by first coarsely locating clips with
entity queries and then finely predicting exact boundaries in a shrunken
temporal region with motion queries. The entity-aware Transformer incorporates
the textual entities into visual representation learning via cross-modal and
cross-frame attentions to facilitate attending action-related video clips. The
motion-aware Transformer captures fine-grained motion changes at multiple
temporal scales via integrating long short-term memory into the self-attention
module to further improve the precision of action boundary prediction.
Extensive experiments on the Charades-STA and TACoS datasets demonstrate that
our method achieves better performance than existing methods.
Related papers
- Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment [33.74853437611066]
Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts.
Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript.
We propose a novel Action-Transition-Aware Boundary Alignment framework to efficiently and effectively filter out noisy boundaries and detect transitions.
arXiv Detail & Related papers (2024-03-28T08:39:44Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.