Related papers: ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

URL: http://arxiv.org/abs/2403.12534v1
Date: Tue, 19 Mar 2024 08:15:53 GMT
Title: ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
Authors: Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang,
Abstract summary: We propose ExACT, a novel approach that tackles event-based action recognition from a cross-modal conceptualizing perspective. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.
Score: 7.797154022794006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras have recently been shown beneficial for practical vision tasks, such as action recognition, thanks to their high temporal resolution, power efficiency, and reduced privacy concerns. However, current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information, rendering it stunningly superior in reducing semantic uncertainty. In light of this, we propose ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly, we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then, we propose a conceptual reasoning-based uncertainty estimation module, which simulates the recognition process to enrich the semantic representation. In particular, conceptual reasoning builds the temporal relation based on the action semantics, and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.

Related papers

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models [7.802379200026965]
We propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state.<n>Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators.
arXiv Detail & Related papers (2026-03-05T13:14:41Z)
SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction [5.730573889498275]
We propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework.<n>We first utilize semantic and uncertainty priors to suppress projections from free space during view transformation.<n>We then employ an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation.
arXiv Detail & Related papers (2026-01-16T16:07:38Z)
Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation [52.6091162517921]
INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
arXiv Detail & Related papers (2025-08-03T12:52:27Z)
Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond [10.015531203047598]
We present ORSANet, which introduces auxiliary multi-modal semantic guidance to disambiguate facial occlusion.<n>We also introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases.<n>Our proposed ORSANet achieves SOTA recognition performance.
arXiv Detail & Related papers (2025-07-21T09:04:29Z)
Uncertainty-boosted Robust Video Activity Anticipation [72.14155465769201]
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision to autonomous driving. Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored. We propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results.
arXiv Detail & Related papers (2024-04-29T12:31:38Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations [9.444540281544715]
We introduce a novel agent for active open-vocabulary recognition. The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge.
arXiv Detail & Related papers (2023-11-28T19:24:07Z)
Recovering Continuous Scene Dynamics from A Single Blurry Image with Events [58.7185835546638]
An Implicit Video Function (IVF) is learned to represent a single motion blurred image with concurrent events. A dual attention transformer is proposed to efficiently leverage merits from both modalities. The proposed network is trained only with the supervision of ground-truth images of limited referenced timestamps.
arXiv Detail & Related papers (2023-04-05T18:44:17Z)
Leveraging Self-Supervised Training for Unintentional Action Recognition [82.19777933440143]
We seek to identify the points in videos where the actions transition from intentional to unintentional. We propose a multi-stage framework that exploits inherent biases such as motion speed, motion direction, and order to recognize unintentional actions.
arXiv Detail & Related papers (2022-09-23T21:36:36Z)
Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition [38.47712256338113]
We build upon the framework of Neural Processes to propose a method for apparent emotion recognition with three key components. We validate our approach on four databases, two for Valence and Arousal estimation and two for Action Unit intensity estimation. Results show a consistent improvement over a series of strong baselines as well as over state-of-the-art methods.
arXiv Detail & Related papers (2021-03-24T17:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.