Related papers: ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

URL: http://arxiv.org/abs/2603.04265v1
Date: Wed, 04 Mar 2026 16:50:07 GMT
Title: ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
Authors: Luigi Seminara, Davide Moltisanti, Antonino Furnari,
Abstract summary: Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal.<n>Existing approaches rely on large-scale models that learn procedural structures implicitly.<n>We introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process.
Score: 15.697653554425045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.

Related papers

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback [59.287761696290865]
We propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback.<n>We demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
arXiv Detail & Related papers (2026-02-09T06:29:54Z)
Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence [2.1665689529884697]
emphGreedyLR is a novel scheduler that adaptively adjusts the learning rate during training based on the current loss.<n>Our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence.
arXiv Detail & Related papers (2025-12-16T16:03:52Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Improving Large Language Model Planning with Action Sequence Similarity [50.52049888490524]
In this work, we explore how to improve the model planning capability through in-context learning (ICL)<n>We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars.<n>Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks.
arXiv Detail & Related papers (2025-05-02T05:16:17Z)
An experimental approach on Few Shot Class Incremental Learning [0.0]
Few-Shot Class-Incremental Learning (FSCIL) represents a cutting-edge paradigm within the broader scope of machine learning.<n>The paper will present different solutions which contain extensive experiments across large-scale datasets.<n>We highlight their advantages and then present an experimental approach with the purpose of improving the most promising one.
arXiv Detail & Related papers (2025-03-14T12:36:15Z)
Boosting Vision-Language Models with Transduction [12.281505126587048]
We present TransCLIP, a novel and computationally efficient transductive approach for vision-language models. TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models.
arXiv Detail & Related papers (2024-06-03T23:09:30Z)
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos [16.333295670635557]
We explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data.
arXiv Detail & Related papers (2024-03-05T08:55:51Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method. PCL implicitly encodes semantic structures of the data into the learned embedding space. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.