Fugu-MT 論文翻訳(概要): CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

論文の概要: CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

arxiv url: http://arxiv.org/abs/2602.12532v1
Date: Fri, 13 Feb 2026 02:28:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.419075
Title: CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning
Title（参考訳）: CRAFT:フォース対応カリキュラムファインチューニングによるコンタクトリッチマニピュレーションへのVLAモデルの適用
Authors: Yike Zhang, Yaonan Wang, Xinxin Sun, Kaizhen Huang, Zhiyuan Xu, Junjie Ji, Zhengping Che, Jian Tang, Jingtao Sun,
Abstract要約: Vision-Language-Actionモデルは一般的な命令を実行できるが、コンタクトリッチな操作タスクに苦労する。 CRAFTは、初期訓練中に視覚と言語埋め込みを調節する力覚カリキュラムの微調整フレームワークである。 CRAFTはタスクの成功を継続的に改善し、未確認のオブジェクトや新しいタスクのバリエーションに一般化し、多様なVLAアーキテクチャに効果的に適応することを示す。
参考スコア（独自算出の注目度）: 46.57805525532354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader-follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボットが一般的な命令を実行できるようにする強力な能力を示しているが、成功には正確なアライメント、安定した接触維持、変形可能なオブジェクトの効率的なハンドリングが必要である。根本的な課題は、高エントロピービジョンと言語入力の不均衡と低エントロピーだが臨界的な力信号から生じ、しばしば知覚と不安定な制御への過度な依存につながる。そこで本研究では,初期訓練における視覚と言語埋め込みの制御のために,情報ボトルネックモジュールを組み込んだ力覚カリキュラムの微調整フレームワークであるCRAFTを紹介する。このカリキュラム戦略は、マルチモーダル情報へのアクセスを段階的に回復する前に、モデルを最初に力信号の優先順位付けを奨励する。力覚学習を実現するために,多様な接触に富んだタスクにまたがる,同期された視覚,言語,強制的なデータを収集する,ホモロジーなリーダ・フォロワー遠隔操作システムをさらに設計する。実世界の実験では、CRAFTはタスクの成功を一貫して改善し、見えないオブジェクトや新しいタスクのバリエーションに一般化し、様々なVLAアーキテクチャに効果的に適用し、堅牢で一般化可能なコンタクトリッチな操作を可能にしている。

論文の概要: CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

関連論文リスト