VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
- URL: http://arxiv.org/abs/2602.07399v1
- Date: Sat, 07 Feb 2026 06:31:53 GMT
- Title: VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
- Authors: Changhua Xu, Jie Lu, Junyu Xuan, En Yu,
- Abstract summary: Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable.<n>We study few-shot VLA adaptation from a emphgeneration--selection perspective and propose a novel framework textbfVGAS (textbfValue-textbfGuided textbfAction-chunk textbfSelection)<n>It performs inference-time best-of-$N$ selection to identify action chunks that are both semantic
- Score: 22.508129824741555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.
Related papers
- OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z) - Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation [22.973340187143616]
We propose Entropy-Guard k-gressive sampling, a strategy that adapts sampling to token-wise dispersion.<n> ENkG uses adaptive token candidate sizes for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity.<n> Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
arXiv Detail & Related papers (2026-01-27T11:19:53Z) - PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations [52.67948063133533]
Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs.<n>Existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces.<n>We propose Promise, a novel framework that integrates dense, step-by-step verification into generative models.
arXiv Detail & Related papers (2026-01-08T07:38:46Z) - Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning [32.249022698727856]
Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images.<n>A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT).<n>We propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset.
arXiv Detail & Related papers (2025-12-31T03:28:17Z) - Geometrically-Constrained Agent for Spatial Reasoning [53.93718394870856]
Vision Language Models exhibit a fundamental semantic-to-geometric gap in spatial reasoning.<n>Current paradigms fail to bridge this gap.<n>We propose a training-free agentic paradigm that resolves this gap by introducing a formal task constraint.
arXiv Detail & Related papers (2025-11-27T17:50:37Z) - On Geometric Structures for Policy Parameterization in Continuous Control [7.056222499095849]
We propose a novel, computationally efficient action generation paradigm that preserves the structural benefits of operating on a unit manifold.<n>Our method decomposes the action into a deterministic directional vector and a learnable concentration, enabling efficient between the target direction and uniform noise.<n> Empirically, our method matches or exceeds state-of-the-art methods on standard continuous control benchmarks.
arXiv Detail & Related papers (2025-11-11T13:32:38Z) - Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction [51.50282796099369]
This paper develops a multi-dimensional instruction uncertainty reduction framework to generate semantically constrained adversarial examples.<n>By predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler.<n>We realize the reference-free generation of semantically constrained 3D adversarial examples for the first time.
arXiv Detail & Related papers (2025-10-27T04:02:52Z) - TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation [9.906359339999039]
We introduce a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model.<n>We propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces.<n>Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts.
arXiv Detail & Related papers (2025-08-08T16:51:44Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z) - Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications.
Existing works essentially directly optimize the single-level objective w.r.t. surrogate model.
We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z) - ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.