Related papers: Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation

Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation

URL: http://arxiv.org/abs/2205.13425v1
Date: Thu, 26 May 2022 15:30:34 GMT
Title: Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation
Authors: Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan
Abstract summary: We design a pure Transformer-based model without temporal convolutions by incorporating the U-Net architecture. We propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries.
Score: 34.502472072265164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the limitations on modeling long-term temporal dependencies and inflexibility of temporal convolutions limit the potential of these models. Recently, Transformer-based models with flexible and strong sequence modeling ability have been applied in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating the U-Net architecture. The U-Transformer architecture reduces complexity while introducing an inductive bias that adjacent frames are more likely to belong to the same class, but the introduction of coarse resolutions results in the misclassification of boundaries. We observe that the similarity distribution between a boundary frame and its neighboring frames depends on whether the boundary frame is the start or end of an action segment. Therefore, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries. Extensive experiments show the effectiveness of our model.

Related papers

Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment [92.57576987521107]
We propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE)<n>QCMoE allows continuous and consistent rate control with appealing R-D performance.<n> Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts.
arXiv Detail & Related papers (2025-12-11T09:14:51Z)
TBT-Former: Learning Temporal Boundary Distributions for Action Localization [1.2461503242570642]
Temporal Boundary Transformer (TBT-Former) is a new architecture for temporal action localization.<n>Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem.<n>TBT-Former sets a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets.
arXiv Detail & Related papers (2025-12-01T05:38:13Z)
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models [34.131515004434846]
We introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient approach for adapting pretrained video diffusion models to conditional generation tasks.<n>TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples.<n>We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B.
arXiv Detail & Related papers (2025-06-01T12:57:43Z)
Faster Diffusion Action Segmentation [9.868244939496678]
Temporal Action Classification (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. We propose EffiDiffAct, an efficient and high-performance TAS algorithm.
arXiv Detail & Related papers (2024-08-04T13:23:18Z)
HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models. Our model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z)
Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling [67.94143911629143]
We propose a generative Transformer VAE architecture to model hand pose and action. To faithfully model the semantic dependency and different temporal granularity of hand pose and action, we decompose the framework into two cascaded VAE blocks. Results show that our joint modeling of recognition and prediction improves over isolated solutions.
arXiv Detail & Related papers (2023-11-29T05:28:39Z)
Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling [105.69197687940505]
We propose to explore the role of explicit temporal difference modeling in both LR and HR space. To further enhance the super-resolution result, not only spatial residual features are extracted, but the difference between consecutive frames in high-frequency domain is also computed.
arXiv Detail & Related papers (2022-04-14T17:07:33Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation [46.02120172459727]
We propose to relax the requirement of reconstructing an intermediate frame as close to the ground-truth (GT) as possible. We develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames.
arXiv Detail & Related papers (2022-03-19T10:37:06Z)
Temporal Transformer Networks with Self-Supervision for Action Recognition [13.00827959393591]
We introduce a startling Temporal Transformer Network with Self-supervision (TTSN) TTSN consists of a temporal transformer module and a temporal sequence self-supervision module. Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
arXiv Detail & Related papers (2021-12-14T12:53:53Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
Haar Wavelet based Block Autoregressive Flows for Trajectories [129.37479472754083]
Prediction of trajectories such as that of pedestrians is crucial to the performance of autonomous agents. We introduce a novel Haar wavelet based block autoregressive model leveraging split couplings. We illustrate the advantages of our approach for generating diverse and accurate trajectories on two real-world datasets.
arXiv Detail & Related papers (2020-09-21T13:57:10Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.