Related papers: Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

URL: http://arxiv.org/abs/2509.20899v1
Date: Thu, 25 Sep 2025 08:35:03 GMT
Title: Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification
Authors: Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt,
Abstract summary: MoTIF is an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification.<n>Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time.
Score: 10.376843346305112
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.

Related papers

Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders [52.94006363830628]
Language-aligned vision foundation models perform strongly across diverse downstream tasks.<n>Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks.<n>We propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image.
arXiv Detail & Related papers (2026-01-20T09:57:26Z)
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition [9.179016800487506]
We propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR)<n>PCBEAR introduces human pose sequences as motion-aware, structured concepts for video action recognition.<n>Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process.
arXiv Detail & Related papers (2025-04-17T17:50:07Z)
Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z)
Understanding Video Transformers via Universal Concept Discovery [44.869479587300525]
We seek to explain the decision-making process of transformers based on high-level,temporal concepts that are automatically discovered. We introduce the first Video Transformer Concept Discovery (VTCD) algorithm. The resulting concepts are highly interpretable, revealingtemporal reasoning mechanisms and object-centric representations in unstructured video models.
arXiv Detail & Related papers (2024-01-19T17:27:21Z)
Automatic Concept Extraction for Concept Bottleneck-based Video Classification [58.11884357803544]
We present an automatic Concept Discovery and Extraction module that rigorously composes a necessary and sufficient set of concept abstractions for concept-based video classification. Our method elicits inherent complex concept abstractions in natural language to generalize concept-bottleneck methods to complex tasks.
arXiv Detail & Related papers (2022-06-21T06:22:35Z)
Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis [105.06166692486674]
We study temporal concept receptive field of concept-based event representation. We introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics. Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos.
arXiv Detail & Related papers (2021-11-23T04:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.