How Much Temporal Long-Term Context is Needed for Action Segmentation?
- URL: http://arxiv.org/abs/2308.11358v2
- Date: Mon, 25 Sep 2023 14:58:59 GMT
- Title: How Much Temporal Long-Term Context is Needed for Action Segmentation?
- Authors: Emad Bahrami, Gianpiero Francesca, Juergen Gall
- Abstract summary: We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
- Score: 16.89998201009075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modeling long-term context in videos is crucial for many fine-grained tasks
including temporal action segmentation. An interesting question that is still
open is how much long-term temporal context is needed for optimal performance.
While transformers can model the long-term context of a video, this becomes
computationally prohibitive for long videos. Recent works on temporal action
segmentation thus combine temporal convolutional networks with self-attentions
that are computed only for a local temporal window. While these approaches show
good results, their performance is limited by their inability to capture the
full context of a video. In this work, we try to answer how much long-term
temporal context is required for temporal action segmentation by introducing a
transformer-based model that leverages sparse attention to capture the full
context of a video. We compare our model with the current state of the art on
three datasets for temporal action segmentation, namely 50Salads, Breakfast,
and Assembly101. Our experiments show that modeling the full context of a video
is necessary to obtain the best performance for temporal action segmentation.
Related papers
- Top-down Activity Representation Learning for Video Question Answering [4.236280446793381]
Capturing complex hierarchical human activities is crucial for achieving high-performance video question answering (VideoQA)
We convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task.
Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
arXiv Detail & Related papers (2024-09-12T04:43:27Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization [52.234877003211814]
We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
arXiv Detail & Related papers (2023-03-16T03:11:26Z) - Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos [60.86880787242561]
Video temporal grounding aims to pinpoint a video segment that matches the query description.
We propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with textbfone-time network execution.
Our method significantly outperforms state-of-the-arts, and achieves textbf14.6$times$ / textbf102.8$times$ higher efficiency respectively.
arXiv Detail & Related papers (2023-03-15T03:54:43Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Streaming Video Temporal Action Segmentation In Real Time [2.8728707559692475]
We propose a real-time end-to-end multi-modality model for streaming video real-time temporal action segmentation task.
Our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.
arXiv Detail & Related papers (2022-09-28T03:27:37Z) - Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.