Related papers: Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

URL: http://arxiv.org/abs/2511.03725v1
Date: Wed, 05 Nov 2025 18:59:35 GMT
Title: Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition
Authors: Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi,
Abstract summary: We propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition.<n>DANCE predicts actions through disentangled concept types: motion dynamics, objects, and scenes.<n> Experiments on four datasets demonstrate that DANCE significantly improves explanation clarity with competitive performance.
Score: 22.38060746037401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

Related papers

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning [68.75730050161219]
Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks.<n>CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues.<n>Visual Evidence Reward (VER) is a reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence.
arXiv Detail & Related papers (2025-10-07T16:03:33Z)
PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition [9.179016800487506]
We propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR)<n>PCBEAR introduces human pose sequences as motion-aware, structured concepts for video action recognition.<n>Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process.
arXiv Detail & Related papers (2025-04-17T17:50:07Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation [27.690736225683825]
We introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis.<n>Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions.<n>Experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism.
arXiv Detail & Related papers (2024-11-30T17:40:49Z)
Human Motion Instruction Tuning [37.3026760535819]
This paper presents LLaMo, a framework for human motion instruction tuning.<n>LLaMo retains motion in its native form for instruction tuning.<n>By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis.
arXiv Detail & Related papers (2024-11-25T14:38:43Z)
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities. We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z)
MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding. We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z)
Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases [59.32509533292653]
Motion understanding aims to establish a reliable mapping between motion and action semantics. We propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality. Based on KP, we can unify a motion knowledge base and build a motion understanding system.
arXiv Detail & Related papers (2023-10-06T12:08:15Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
Static and Dynamic Concepts for Self-supervised Video Representation Learning [70.15341866794303]
We propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding.
arXiv Detail & Related papers (2022-07-26T10:28:44Z)
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.