Related papers: Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

URL: http://arxiv.org/abs/2107.09504v1
Date: Sun, 18 Jul 2021 16:21:35 GMT
Title: Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos
Authors: Olga Zatsarynna, Yazan Abu Farha and Juergen Gall
Abstract summary: Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. We propose a simple and effective multi-modal architecture based on temporal convolutions.
Score: 22.90184887794109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster.

Related papers

ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture [4.190790144182306]
It is acknowledged that human drivers dynamically adjust initial driving decisions based on assumptions about the intentions surrounding vehicles.<n>Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor SelectionDAS (DAS) module.<n> Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets.
arXiv Detail & Related papers (2025-07-09T04:18:01Z)
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models [21.645510959114326]
A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses.<n>Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results.<n>Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step.<n>ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real
arXiv Detail & Related papers (2025-06-09T13:11:02Z)
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up. It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention. It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout. DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
Multi-agent Traffic Prediction via Denoised Endpoint Distribution [23.767783008524678]
Trajectory prediction at high speeds requires historical features and interactions with surrounding entities. We present the Denoised Distribution model for trajectory prediction. Our approach significantly reduces model complexity and performance through endpoint information.
arXiv Detail & Related papers (2024-05-11T15:41:32Z)
AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction. Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving [57.89801036693292]
PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving) considers the timestep-wise interaction to better integrate prediction and planning. We design ego-to-agent, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions.
arXiv Detail & Related papers (2023-11-14T11:53:24Z)
Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z)
ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals [6.927103549481412]
Motion forecasting is a key module in an autonomous driving system. Due to the heterogeneous nature of multi-sourced input, multimodality in agent behavior, and low latency required by onboard deployment, this task is notoriously challenging. This paper proposes a novel agent-centric model with anchor-informed proposals for efficient multimodal motion prediction.
arXiv Detail & Related papers (2023-03-21T17:58:28Z)
SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction [12.248428883804763]
3D human motion prediction is a research area computation of high significance and a challenge in computer vision. Traditionally, autogregressive models have been used to predict human motion. We present a non-autoregressive model for human motion prediction.
arXiv Detail & Related papers (2023-03-11T01:44:29Z)
SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos [2.6572330982240935]
We build upon RULSTM architecture, which is specifically designed for anticipating human actions. We propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy.
arXiv Detail & Related papers (2021-09-02T10:20:18Z)
Temporal Pyramid Network for Pedestrian Trajectory Prediction with Multi-Supervision [27.468166556263256]
We propose a temporal pyramid network for pedestrian trajectory prediction through a squeeze modulation and a dilation modulation. Our hierarchical framework builds a feature pyramid with increasingly richer temporal information from top to bottom, which can better capture the motion behavior at various tempos. By progressively merging the top coarse features of global context to the bottom fine features of rich local context, our method can fully exploit both the long-range and short-range information of the trajectory.
arXiv Detail & Related papers (2020-12-03T13:02:59Z)
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC) First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.