Fugu-MT 論文翻訳(概要): SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

論文の概要: SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

arxiv url: http://arxiv.org/abs/2605.01720v1
Date: Sun, 03 May 2026 05:26:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.903311
Title: SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages
Title（参考訳）: SignVerse-2M:25以上の手話の2ミリクリックポッド負の宇宙
Authors: Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas,
Abstract要約: 我々は手話ポーズモデリングと評価のための大規模多言語ポーズネイティブデータセットSignVerse-2Mを提案する。これはDWPoseを統合前処理パイプラインに適用し、生のビデオを2次元のポーズシーケンスに変換し、モデリングに直接使用することができる。多くの実験室のデータセットとは異なり、このリソースは実世界のビデオの記録条件と話者の多様性を保存する。
参考スコア（独自算出の注目度）: 28.65355856480869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.
Abstract（参考訳）: 既存の大規模な手話資源は、通常、生のビデオテキストアライメントのレベルにのみ監督を提供し、しばしば実験室で作られる。このようなリソースはセマンティックな理解には重要ですが、オープンワールドの認識と翻訳、あるいはモダンなポーズ駆動手話ビデオ生成フレームワークに対して、直接的に統一されたインターフェースを提供していません。 1.RGBベースの事前訓練型認識モデルは、記録中の固定背景や衣服の状態に大きく依存しており、スタイルに依存しないポーズ処理モデルよりもオープンワールド環境では堅牢ではない。 2)最近のポーズ誘導画像/ビデオ生成モデルでは,DWPoseなどのキーポイントの統一表現を制御インタフェースとして利用している。現在、手話フィールドには、このモダンなポーズネイティブパラダイムと直接インターフェースできるデータリソースがなく、実際のオープンシナリオもターゲットとしています。我々は手話ポーズモデリングと評価のための大規模多言語ポーズネイティブデータセットSignVerse-2Mを提案する。公開されている多言語手話ビデオリソースから構築されたDWPoseは、DWPoseを統一された前処理パイプラインに適用し、生のビデオを2Dポーズシーケンスに変換し、モデリングに直接使用することができる。多くの実験室のデータセットとは異なり、このリソースは実世界のビデオの記録条件と話者の多様性を保存し、統一されたポーズ表現によって外観の変化を減少させる。この目標に向けて、我々はさらにデータ構築パイプライン、タスク定義、単純なSignDW Transformerベースラインを提供し、多言語ポーズ空間モデリングのためのこのリソースの実現可能性と、現代のポーズ駆動パイプラインとの互換性を実証するとともに、サポート可能な評価条件と現在の制限について論じる。

論文の概要: SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

関連論文リスト