Related papers: PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

URL: http://arxiv.org/abs/2504.13140v1
Date: Thu, 17 Apr 2025 17:50:07 GMT
Title: PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition
Authors: Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi,
Abstract summary: We propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR)<n>PCBEAR introduces human pose sequences as motion-aware, structured concepts for video action recognition.<n>Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process.
Score: 9.179016800487506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

Related papers

Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions [70.87254264798341]
PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time.<n>It reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type.
arXiv Detail & Related papers (2025-12-09T11:05:08Z)
Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition [22.38060746037401]
We propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition.<n>DANCE predicts actions through disentangled concept types: motion dynamics, objects, and scenes.<n> Experiments on four datasets demonstrate that DANCE significantly improves explanation clarity with competitive performance.
arXiv Detail & Related papers (2025-11-05T18:59:35Z)
Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification [10.376843346305112]
MoTIF is an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification.<n>Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time.
arXiv Detail & Related papers (2025-09-25T08:35:03Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
HuMoCon: Concept Discovery for Human Motion Understanding [14.987145689605084]
HuMoCon is a motion-video understanding framework for advanced human behavior analysis.<n>HuMoCon trains multi-modal encoders to extract semantically meaningful and generalizable features.
arXiv Detail & Related papers (2025-05-27T09:10:59Z)
Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter [52.08332620725473]
We propose a tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning.<n>Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
arXiv Detail & Related papers (2025-05-24T09:21:32Z)
Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models [5.985204759362746]
We present a unified framework for transforming any vision neural network into a spatially and conceptually interpretable model.<n>We name this method "Spatially-Aware and Label-Free Concept Bottleneck Model" (SALF-CBM)
arXiv Detail & Related papers (2025-02-27T14:27:55Z)
Dynamic Concepts Personalization from Single Videos [92.62863918003575]
We introduce Set-and-Sequence, a novel framework for personalizing generative video models with dynamic concepts.<n>Our approach imposes a-temporal weight space within an architecture that does not explicitly separate spatial and temporal features.<n>Our framework embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality.
arXiv Detail & Related papers (2025-02-20T18:53:39Z)
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space [14.188708813577456]
We analyze a model's learning dynamics via a framework we call the concept space. We observe moments of sudden turns in the direction of a model's learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities.
arXiv Detail & Related papers (2024-06-27T17:50:05Z)
Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models [57.86303579812877]
Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions. Existing approaches often require numerous human interventions per image to achieve strong performances. We introduce a trainable concept realignment intervention module, which leverages concept relations to realign concept assignments post-intervention.
arXiv Detail & Related papers (2024-05-02T17:59:01Z)
Exploring Explainability in Video Action Recognition [5.7782784592048575]
Video Action Recognition and Image Classification are foundational tasks in computer vision. Video-TCAV aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models. We propose a machine-assisted approach to generate spatial andtemporal concepts relevant to Video Action Recognition for testing Video-TCAV.
arXiv Detail & Related papers (2024-04-13T19:34:14Z)
Collaboratively Self-supervised Video Representation Learning for Action Recognition [54.92120002380786]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.<n>Our method achieves state-of-the-art performance on multiple popular video datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z)
Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks. Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training. This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z)
Static and Dynamic Concepts for Self-supervised Video Representation Learning [70.15341866794303]
We propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding.
arXiv Detail & Related papers (2022-07-26T10:28:44Z)
Automatic Concept Extraction for Concept Bottleneck-based Video Classification [58.11884357803544]
We present an automatic Concept Discovery and Extraction module that rigorously composes a necessary and sufficient set of concept abstractions for concept-based video classification. Our method elicits inherent complex concept abstractions in natural language to generalize concept-bottleneck methods to complex tasks.
arXiv Detail & Related papers (2022-06-21T06:22:35Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Interpretable Visual Reasoning via Induced Symbolic Space [75.95241948390472]
We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images. We first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features. We then come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words.
arXiv Detail & Related papers (2020-11-23T18:21:49Z)
Explaining Motion Relevance for Activity Recognition in Video Deep Learning Models [12.807049446839507]
A small subset of explainability techniques has been applied for interpretability of 3D Convolutional Neural Network models in activity recognition tasks. We propose a selective relevance method for adapting the 2D explanation techniques to provide motion-specific explanations. Our results show that the selective relevance method can not only provide insight on the role played by motion in the model's decision -- in effect, revealing and quantifying the model's spatial bias -- but the method also simplifies the resulting explanations for human consumption.
arXiv Detail & Related papers (2020-03-31T15:19:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.