Related papers: HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis

HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis

URL: http://arxiv.org/abs/2508.16942v1
Date: Sat, 23 Aug 2025 08:19:27 GMT
Title: HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis
Authors: Junhao Wu, Xiuer Gu, Zhiying Li, Yeying Jin, Yunfeng Diao, Zhiyu Li, Zhenbo Song, Xiaomei Zhang, Zhaoxin Fan,
Abstract summary: HieroAction is a vision-language model that delivers accurate and structured assessments of human actions.<n>The reasoning pathway structures the evaluation process, while policy learning refines each stage through reward based optimization.<n>Their integration ensures accurate and interpretable assessments, as demonstrated by superior performance across multiple benchmark datasets.
Score: 33.807258169748465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating human actions with clear and detailed feedback is important in areas such as sports, healthcare, and robotics, where decisions rely not only on final outcomes but also on interpretable reasoning. However, most existing methods provide only a final score without explanation or detailed analysis, limiting their practical applicability. To address this, we introduce HieroAction, a vision-language model that delivers accurate and structured assessments of human actions. HieroAction builds on two key ideas: (1) Stepwise Action Reasoning, a tailored chain of thought process designed specifically for action assessment, which guides the model to evaluate actions step by step, from overall recognition through sub action analysis to final scoring, thus enhancing interpretability and structured understanding; and (2) Hierarchical Policy Learning, a reinforcement learning strategy that enables the model to learn fine grained sub action dynamics and align them with high level action quality, thereby improving scoring precision. The reasoning pathway structures the evaluation process, while policy learning refines each stage through reward based optimization. Their integration ensures accurate and interpretable assessments, as demonstrated by superior performance across multiple benchmark datasets. Code will be released upon acceptance.

Related papers

Action-Sufficient Goal Representations [18.88691169447082]
We introduce an information-theoretic framework that defines action sufficiency, a condition on goal representations necessary for optimal action selection.<n>We prove that value sufficiency does not imply action sufficiency and empirically verify that the latter is more strongly associated with control success in a discrete environment.
arXiv Detail & Related papers (2026-01-30T03:08:37Z)
Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning [45.80546806373221]
We define a new Human Action Form Assessment task and introduce a new diverse dataset CoT-AFA.<n>We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm.<n>We propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution.
arXiv Detail & Related papers (2025-12-17T07:35:03Z)
CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent [53.56274149236814]
We propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents.<n>Carl achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
arXiv Detail & Related papers (2025-12-04T16:15:46Z)
What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation [67.47463575774388]
We decompose reasoning quality into two dimensions: relevance and coherence.<n>To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE)<n>We show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance.
arXiv Detail & Related papers (2025-10-23T14:30:37Z)
StepWiser: Stepwise Generative Judges for Wiser Reasoning [52.32416311990343]
Process reward models address this by providing step-by-step feedback.<n>Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself.<n>We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
arXiv Detail & Related papers (2025-08-26T17:45:05Z)
Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate.<n>Manual evaluation, while comprehensive, is often costly and resource-intensive.<n>automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z)
Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons [9.954960702259918]
This paper introduces Themis, a fine-tuned large language model (LLMs) judge that delivers context-aware evaluations.<n>We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts.<n>We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner.
arXiv Detail & Related papers (2025-02-05T08:35:55Z)
A Unified Understanding and Evaluation of Steering Methods [17.420727709895736]
Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations.<n>Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets.<n>This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness.
arXiv Detail & Related papers (2025-02-04T20:55:24Z)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.28188747489769]
We propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge.<n>In a self-training loop, EvalPlanner iteratively optimize over synthetically constructed evaluation plans and executions.<n>Our method achieves a new state-of-the-art performance for generative reward models on RewardBench.
arXiv Detail & Related papers (2025-01-30T02:21:59Z)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.