DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding
- URL: http://arxiv.org/abs/2312.00826v2
- Date: Thu, 11 Jul 2024 06:32:32 GMT
- Title: DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding
- Authors: Kyungho Bae, Geo Ahn, Youngrae Kim, Jinwoo Choi,
- Abstract summary: We propose Disentangled VIdeo representations of Action and Scene (DEVIAS) to achieve holistic video understanding.
Our proposed method shows favorable performance across different datasets compared to the baselines.
- Score: 3.336126457178601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When watching a video, humans can naturally extract human actions from the surrounding scene context, even when action-scene combinations are unusual. However, unlike humans, video action recognition models often learn scene-biased action representations from the spurious correlation in training data, leading to poor performance in out-of-context scenarios. While scene-debiased models achieve improved performance in out-of-context scenarios, they often overlook valuable scene information in the data. Addressing this challenge, we propose Disentangled VIdeo representations of Action and Scene (DEVIAS), which aims to achieve holistic video understanding. Disentangled action and scene representations with our method could provide flexibility to adjust the emphasis on action or scene information depending on downstream task and dataset characteristics. Disentangled action and scene representations could be beneficial for both in-context and out-of-context video understanding. To this end, we employ slot attention to learn disentangled action and scene representations with a single model, along with auxiliary tasks that further guide slot attention. We validate the proposed method on both in-context datasets: UCF-101 and Kinetics-400, and out-of-context datasets: SCUBA and HAT. Our proposed method shows favorable performance across different datasets compared to the baselines, demonstrating its effectiveness in diverse video understanding scenarios.
Related papers
- Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - SOAR: Scene-debiasing Open-set Action Recognition [81.8198917049666]
We propose Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning.
The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information.
arXiv Detail & Related papers (2023-09-03T20:20:48Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.