Streaming Video Temporal Action Segmentation In Real Time
- URL: http://arxiv.org/abs/2209.13808v3
- Date: Tue, 10 Oct 2023 03:00:48 GMT
- Title: Streaming Video Temporal Action Segmentation In Real Time
- Authors: Wujun Wen, Yunheng Li, Zhuben Dong, Lin Feng, Wanxiao Yang, Shenlan
Liu
- Abstract summary: We propose a real-time end-to-end multi-modality model for streaming video real-time temporal action segmentation task.
Our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.
- Score: 2.8728707559692475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action segmentation (TAS) is a critical step toward long-term video
understanding. Recent studies follow a pattern that builds models based on
features instead of raw video picture information. However, we claim those
models are trained complicatedly and limit application scenarios. It is hard
for them to segment human actions of video in real time because they must work
after the full video features are extracted. As the real-time action
segmentation task is different from TAS task, we define it as streaming video
real-time temporal action segmentation (SVTAS) task. In this paper, we propose
a real-time end-to-end multi-modality model for SVTAS task. More specifically,
under the circumstances that we cannot get any future information, we segment
the current human action of streaming video chunk in real time. Furthermore,
the model we propose combines the last steaming video chunk feature extracted
by language model with the current image feature extracted by image model to
improve the quantity of real-time temporal action segmentation. To the best of
our knowledge, it is the first multi-modality real-time temporal action
segmentation model. Under the same evaluation criteria as full video temporal
action segmentation, our model segments human action in real time with less
than 40% of state-of-the-art model computation and achieves 90% of the accuracy
of the full video state-of-the-art model.
Related papers
- Top-down Activity Representation Learning for Video Question Answering [4.236280446793381]
Capturing complex hierarchical human activities is crucial for achieving high-performance video question answering (VideoQA)
We convert long-term video sequences into a spatial image domain and finetune the multimodal model LLaVA for the VideoQA task.
Our approach achieves competitive performance on the STAR task, in particular, with a 78.4% accuracy score, exceeding the current state-of-the-art score by 2.8 points on the NExTQA task.
arXiv Detail & Related papers (2024-09-12T04:43:27Z) - SAM 2: Segment Anything in Images and Videos [63.44869623822368]
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.
We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Our model is a simple transformer architecture with streaming memory for real-time video processing.
arXiv Detail & Related papers (2024-08-01T17:00:08Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - How Much Temporal Long-Term Context is Needed for Action Segmentation? [16.89998201009075]
We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
arXiv Detail & Related papers (2023-08-22T11:20:40Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization [52.234877003211814]
We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
arXiv Detail & Related papers (2023-03-16T03:11:26Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Activity Graph Transformer for Temporal Action Localization [41.69734359113706]
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization.
In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs.
Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
arXiv Detail & Related papers (2021-01-21T10:42:48Z) - Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.