SegTAD: Precise Temporal Action Detection via Semantic Segmentation
- URL: http://arxiv.org/abs/2203.01542v1
- Date: Thu, 3 Mar 2022 06:52:13 GMT
- Title: SegTAD: Precise Temporal Action Detection via Semantic Segmentation
- Authors: Chen Zhao, Merey Ramazanova, Mengmeng Xu, Bernard Ghanem
- Abstract summary: We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
- Score: 65.01826091117746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action detection (TAD) is an important yet challenging task in video
analysis. Most existing works draw inspiration from image object detection and
tend to reformulate it as a proposal generation - classification problem.
However, there are two caveats with this paradigm. First, proposals are not
equipped with annotated labels, which have to be empirically compiled, thus the
information in the annotations is not necessarily precisely employed in the
model training process. Second, there are large variations in the temporal
scale of actions, and neglecting this fact may lead to deficient representation
in the video features. To address these issues and precisely model temporal
action detection, we formulate the task of temporal action detection in a novel
perspective of semantic segmentation. Owing to the 1-dimensional property of
TAD, we are able to convert the coarse-grained detection annotations to
fine-grained semantic segmentation annotations for free. We take advantage of
them to provide precise supervision so as to mitigate the impact induced by the
imprecise proposal labels. We propose an end-to-end framework SegTAD composed
of a 1D semantic segmentation network (1D-SSN) and a proposal detection network
(PDN).
Related papers
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines.
It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module.
Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - Plug-and-Play Few-shot Object Detection with Meta Strategy and Explicit
Localization Inference [78.41932738265345]
This paper proposes a plug detector that can accurately detect the objects of novel categories without fine-tuning process.
We introduce two explicit inferences into the localization process to reduce its dependence on annotated data.
It shows a significant lead in both efficiency, precision, and recall under varied evaluation protocols.
arXiv Detail & Related papers (2021-10-26T03:09:57Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Relaxed Transformer Decoders for Direct Action Proposal Generation [30.516462193231888]
This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation.
To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR)
Experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net.
arXiv Detail & Related papers (2021-02-03T06:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.