Learning a Weakly-Supervised Video Actor-Action Segmentation Model with
a Wise Selection
- URL: http://arxiv.org/abs/2003.13141v1
- Date: Sun, 29 Mar 2020 21:15:18 GMT
- Title: Learning a Weakly-Supervised Video Actor-Action Segmentation Model with
a Wise Selection
- Authors: Jie Chen, Zhiheng Li, Jiebo Luo, and Chenliang Xu
- Abstract summary: We address weakly-supervised video actor-action segmentation (VAAS)
We propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS2)
WS2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks.
- Score: 97.98805233539633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address weakly-supervised video actor-action segmentation (VAAS), which
extends general video object segmentation (VOS) to additionally consider action
labels of the actors. The most successful methods on VOS synthesize a pool of
pseudo-annotations (PAs) and then refine them iteratively. However, they face
challenges as to how to select from a massive amount of PAs high-quality ones,
how to set an appropriate stop condition for weakly-supervised training, and
how to initialize PAs pertaining to VAAS. To overcome these challenges, we
propose a general Weakly-Supervised framework with a Wise Selection of training
samples and model evaluation criterion (WS^2). Instead of blindly trusting
quality-inconsistent PAs, WS^2 employs a learning-based selection to select
effective PAs and a novel region integrity criterion as a stopping condition
for weakly-supervised training. In addition, a 3D-Conv GCAM is devised to adapt
to the VAAS task. Extensive experiments show that WS^2 achieves
state-of-the-art performance on both weakly-supervised VOS and VAAS tasks and
is on par with the best fully-supervised method on VAAS.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z) - Skill Disentanglement for Imitation Learning from Suboptimal
Demonstrations [60.241144377865716]
We consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set.
We propose method by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills.
arXiv Detail & Related papers (2023-06-13T17:24:37Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Active Learning with Effective Scoring Functions for Semi-Supervised
Temporal Action Localization [15.031156121516211]
This paper focuses on a rarely investigated yet practical task named semi-supervised TAL.
We propose an effective active learning method, named AL-STAL.
Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.
arXiv Detail & Related papers (2022-08-31T13:39:38Z) - W2N:Switching From Weak Supervision to Noisy Supervision for Object
Detection [64.10643170523414]
We propose a novel WSOD framework with a new paradigm that switches from weak supervision to noisy supervision (W2N)
In the localization adaptation module, we propose a regularization loss to reduce the proportion of discriminative parts in original pseudo ground-truths.
Our W2N outperforms all existing pure WSOD methods and transfer learning methods.
arXiv Detail & Related papers (2022-07-25T12:13:48Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.