Multi-Modal Temporal Convolutional Network for Anticipating Actions in
Egocentric Videos
- URL: http://arxiv.org/abs/2107.09504v1
- Date: Sun, 18 Jul 2021 16:21:35 GMT
- Title: Multi-Modal Temporal Convolutional Network for Anticipating Actions in
Egocentric Videos
- Authors: Olga Zatsarynna, Yazan Abu Farha and Juergen Gall
- Abstract summary: Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process.
This poses a problem for domains such as autonomous driving, where the reaction time is crucial.
We propose a simple and effective multi-modal architecture based on temporal convolutions.
- Score: 22.90184887794109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Anticipating human actions is an important task that needs to be addressed
for the development of reliable intelligent agents, such as self-driving cars
or robot assistants. While the ability to make future predictions with high
accuracy is crucial for designing the anticipation approaches, the speed at
which the inference is performed is not less important. Methods that are
accurate but not sufficiently fast would introduce a high latency into the
decision process. Thus, this will increase the reaction time of the underlying
system. This poses a problem for domains such as autonomous driving, where the
reaction time is crucial. In this work, we propose a simple and effective
multi-modal architecture based on temporal convolutions. Our approach stacks a
hierarchy of temporal convolutional layers and does not rely on recurrent
layers to ensure a fast prediction. We further introduce a multi-modal fusion
mechanism that captures the pairwise interactions between RGB, flow, and object
modalities. Results on two large-scale datasets of egocentric videos,
EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves
comparable performance to the state-of-the-art approaches while being
significantly faster.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Multi-agent Traffic Prediction via Denoised Endpoint Distribution [23.767783008524678]
Trajectory prediction at high speeds requires historical features and interactions with surrounding entities.
We present the Denoised Distribution model for trajectory prediction.
Our approach significantly reduces model complexity and performance through endpoint information.
arXiv Detail & Related papers (2024-05-11T15:41:32Z) - AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction.
Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations.
We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z) - Interactive Autonomous Navigation with Internal State Inference and
Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework.
These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents.
Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z) - PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving [57.89801036693292]
PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving) considers the timestep-wise interaction to better integrate prediction and planning.
We design ego-to-agent, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions.
arXiv Detail & Related papers (2023-11-14T11:53:24Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - ProphNet: Efficient Agent-Centric Motion Forecasting with
Anchor-Informed Proposals [6.927103549481412]
Motion forecasting is a key module in an autonomous driving system.
Due to the heterogeneous nature of multi-sourced input, multimodality in agent behavior, and low latency required by onboard deployment, this task is notoriously challenging.
This paper proposes a novel agent-centric model with anchor-informed proposals for efficient multimodal motion prediction.
arXiv Detail & Related papers (2023-03-21T17:58:28Z) - SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction [12.248428883804763]
3D human motion prediction is a research area computation of high significance and a challenge in computer vision.
Traditionally, autogregressive models have been used to predict human motion.
We present a non-autoregressive model for human motion prediction.
arXiv Detail & Related papers (2023-03-11T01:44:29Z) - SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric
Videos [2.6572330982240935]
We build upon RULSTM architecture, which is specifically designed for anticipating human actions.
We propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities.
Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy.
arXiv Detail & Related papers (2021-09-02T10:20:18Z) - Temporal Pyramid Network for Pedestrian Trajectory Prediction with
Multi-Supervision [27.468166556263256]
We propose a temporal pyramid network for pedestrian trajectory prediction through a squeeze modulation and a dilation modulation.
Our hierarchical framework builds a feature pyramid with increasingly richer temporal information from top to bottom, which can better capture the motion behavior at various tempos.
By progressively merging the top coarse features of global context to the bottom fine features of rich local context, our method can fully exploit both the long-range and short-range information of the trajectory.
arXiv Detail & Related papers (2020-12-03T13:02:59Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.