Related papers: Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

URL: http://arxiv.org/abs/2203.00307v1
Date: Tue, 1 Mar 2022 09:31:30 GMT
Title: Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
Authors: Jing Tan, Yuhong Wang, Gangshan Wu, Limin Wang
Abstract summary: Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units. Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM. We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
Score: 48.33132632418303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units, and could server as an important pre-processing step for long-form video understanding. Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM. Instead, in this paper, our objective is to develop a general yet simple architecture for arbitrary boundary detection in videos. To this end, we present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant input into fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to leverage the coherence structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence regions accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on cross-attention to explicitly encourage the boundary queries to attend on the top possible boundaries. Finally, we present a sparse detection head on the compressed representations and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of detection benchmarks, ranging from shot-level, event-level, to scene-level GBD. Our method surpasses the previous state-of-the-art methods on all benchmarks, demonstrating the generalization ability of our temporal perceiver.

Related papers

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS) Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z)
Local Compressed Video Stream Learning for Generic Event Boundary Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network. We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z)
Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
Push-the-Boundary: Boundary-aware Feature Propagation for Semantic Segmentation of 3D Point Clouds [0.5249805590164901]
We propose a boundary-aware feature propagation mechanism to improve semantic segmentation near object boundaries. With one shared encoder, our network outputs (i) boundary localization, (ii) prediction of directions pointing to the object's interior, and (iii) semantic segmentation, in three parallel streams. Our proposed approach yields consistent improvements by reducing boundary errors.
arXiv Detail & Related papers (2022-12-23T15:42:01Z)
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection. We first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z)
Boundary Guided Context Aggregation for Semantic Segmentation [23.709865471981313]
We exploit boundary as a significant guidance for context aggregation to promote the overall semantic understanding of an image. We conduct extensive experiments on the Cityscapes and ADE20K databases, and comparable results are achieved with the state-of-the-art methods.
arXiv Detail & Related papers (2021-10-27T17:04:38Z)
Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames. In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information. The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z)
The Devil is in the Boundary: Exploiting Boundary Representation for Basis-based Instance Segmentation [85.153426159438]
We propose Basis based Instance(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods. Our B2Inst leads to consistent improvements and accurately parses out the instance boundaries in a scene.
arXiv Detail & Related papers (2020-11-26T11:26:06Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.