Video Region Annotation with Sparse Bounding Boxes
- URL: http://arxiv.org/abs/2008.07049v1
- Date: Mon, 17 Aug 2020 01:27:20 GMT
- Title: Video Region Annotation with Sparse Bounding Boxes
- Authors: Yuzheng Xu, Yang Wu, Nur Sabrina binti Zuraimi, Shohei Nobuhara and Ko
Nishino
- Abstract summary: We learn to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions.
We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries.
- Score: 29.323784279321337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video analysis has been moving towards more detailed interpretation (e.g.
segmentation) with encouraging progresses. These tasks, however, increasingly
rely on densely annotated training data both in space and time. Since such
annotation is labour-intensive, few densely annotated video data with detailed
region boundaries exist. This work aims to resolve this dilemma by learning to
automatically generate region boundaries for all frames of a video from
sparsely annotated bounding boxes of target regions. We achieve this with a
Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find
keypoints on the region boundaries using the spatio-temporal volume of
surrounding appearance and motion. The global optimization of VGCN makes it
significantly stronger and generalize better than existing solutions.
Experimental results using two latest datasets (one real and one synthetic),
including ablation studies, demonstrate the effectiveness and superiority of
our method.
Related papers
- EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
Grounding with Multimodal Large Language Model [63.93372634950661]
We propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries.
Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries.
arXiv Detail & Related papers (2023-12-05T04:15:56Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learning Temporal Distribution and Spatial Correlation Towards Universal
Moving Object Segmentation [8.807766029291901]
We propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation.
In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for scene-independent segmentation.
The proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters.
arXiv Detail & Related papers (2023-04-19T20:03:09Z) - Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z) - BI-GCN: Boundary-Aware Input-Dependent Graph Convolution Network for
Biomedical Image Segmentation [21.912509900254364]
We apply graph convolution into the segmentation task and propose an improved textitLaplacian.
Our method outperforms the state-of-the-art approaches on the segmentation of polyps in colonoscopy images and of the optic disc and optic cup in colour fundus images.
arXiv Detail & Related papers (2021-10-27T21:12:27Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations.
We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.