Related papers: Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

URL: http://arxiv.org/abs/2512.06951v1
Date: Sun, 07 Dec 2025 18:08:45 GMT
Title: Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Authors: Ilia Larchenko, Gleb Zarin, Akash Karnatak,
Abstract summary: We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge.<n>The BEHAVIOR Challenge is a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation.<n>Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Related papers

Shared Multi-modal Embedding Space for Face-Voice Association [21.92195248206171]
The FAME 2026 challenge comprises two demanding tasks: training face-voice associations and testing on languages on which the model was not trained.<n>Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction.<n>Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
arXiv Detail & Related papers (2025-12-04T14:04:15Z)
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
A Technical Report on the Second Place Solution for the CIKM 2025 AnalytiCup Competition [11.41948435879935]
This work addresses the challenge of multilingual category relevance judgment in e-commerce search.<n>We propose a framework that leverages prompt engineering with Chain-of-Thought task decomposition.<n> Experimental results show that our single-model framework achieves competitive accuracy and high inference efficiency.
arXiv Detail & Related papers (2025-10-25T16:31:21Z)
Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis [7.392659193819963]
Traffic safety analysis requires complex video understanding to capture behavioral patterns and generate descriptions for accident prevention.<n>In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization.
arXiv Detail & Related papers (2025-10-13T20:18:23Z)
NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results [159.15538432295656]
The NTIRE 2025 image super-resolution ($times$4) challenge is one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025.<n>The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $times$4 scaling factor.<n>A total of 286 participants registered for the competition, with 25 teams submitting valid entries.
arXiv Detail & Related papers (2025-04-20T12:08:22Z)
NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results [162.7095344078484]
We present an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring.<n>The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring.<n>We anticipate that this challenge will drive further advancements in event-based vision research.
arXiv Detail & Related papers (2025-04-16T18:06:16Z)
The Tenth NTIRE 2025 Image Denoising Challenge Report [145.50639422469158]
The primary objective is to develop a network architecture capable of achieving high-quality denoising performance.<n>The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50.<n>A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results.
arXiv Detail & Related papers (2025-04-16T17:35:09Z)
CoMP: Continual Multimodal Pre-training for Vision Foundation Models [72.3323674291719]
We continually pre-train prevailing Vision Foundation Models (VFMs) in a multimodal manner.<n>We introduce CoMP, a carefully designed multimodal pre-training pipeline.<n>Leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements in multimodal understanding tasks.
arXiv Detail & Related papers (2025-03-24T17:52:47Z)
1st Place in ICCV 2023 Workshop Challenge Track 1 on Resource Efficient Deep Learning for Computer Vision: Budgeted Model Training Challenge [15.213786895534225]
We describe a resource-aware backbone search framework composed of profile and instantiation phases. We employ multi-resolution ensembles to boost inference accuracy on limited resources. Based on our approach, we win first place in International Conference on Computer Vision (ICCV) 2023 Workshop Challenge Track 1 on Resource Efficient Deep Learning for Computer Vision (RCV)
arXiv Detail & Related papers (2023-08-09T05:38:18Z)
2nd Place Solution for SODA10M Challenge 2021 -- Continual Detection Track [35.06282647572304]
We adapt ResNet50-FPN as the baseline and try several improvements for the final submission model. We find that task-specific replay scheme, learning rate scheduling, model calibration, and using original image scale helps to improve performance for both large and small objects in images.
arXiv Detail & Related papers (2021-10-25T15:58:19Z)
Two-Stream Consensus Network: Submission to HACS Challenge 2021 Weakly-Supervised Learning Track [78.64815984927425]
The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos. We adopt the two-stream consensus network (TSCN) as the main framework in this challenge. Our solution ranked 2rd in this challenge, and we hope our method can serve as a baseline for future academic research.
arXiv Detail & Related papers (2021-06-21T03:36:36Z)
CVPR 2020 Continual Learning in Computer Vision Competition: Approaches, Results, Current Challenges and Future Directions [25.791936837340877]
The first Continual Learning in Computer Vision challenge held at CVPR in 2020 has been one of the first opportunities to evaluate different continual learning algorithms. We report the main results of the competition, which counted more than 79 teams registered, 11 finalists and 2300$ in prizes.
arXiv Detail & Related papers (2020-09-14T08:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.