Fugu-MT 論文翻訳(概要): DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

論文の概要: DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

arxiv url: http://arxiv.org/abs/2604.26565v1
Date: Wed, 29 Apr 2026 11:51:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.378577
Title: DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
Title（参考訳）: DenseStep2M:Dense Instructional Video Annotationのためのスケーラブルでトレーニング不要なパイプライン
Authors: Mingji Ge, Qirui Chen, Zeqian Li, Weidi Xie,
Abstract要約: In-the-wildインストラクショナルビデオから高品質な手続きアノテーションを抽出するための、自動訓練不要パイプラインを導入する。提案手法では,映像をコヒーレントなショットに分割し,不整合なコンテンツをフィルタリングし,時間的に基礎付けられた手続きステップを生成する。このパイプラインは、約100Kのビデオと2Mの詳細な命令ステップからなる大規模なデータセットであるDenseStep2Mを生成する。
参考スコア（独自算出の注目度）: 48.62186630490615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.
Abstract（参考訳）: 長期的なビデオ理解には、複雑な時間的出来事を解釈し、手続き的活動に対する推論が必要である。 HowTo100Mのような教育ビデオコーパスは、モデルトレーニングに豊富なリソースを提供するが、ノイズの多いASR文字起こしや、ナレーションとビジュアルコンテンツ間の矛盾した時間的アライメントなど、大きな課題を提示する。そこで本研究では,高品質なプロシージャアノテーションをWildインストラクショナルビデオから抽出する,自動学習不要パイプラインを提案する。提案手法では,映像をコヒーレントなショットに分割し,不整合なコンテンツをフィルタリングし,最先端のマルチモーダルおよび大規模言語モデル(Qwen2.5-VLとDeepSeek-R1)を活用して,構造化された時間的基盤のプロシージャステップを生成する。このパイプラインは、約100Kのビデオと2Mの詳細な命令ステップからなる大規模なデータセットであるDenseStep2Mを出力し、包括的なロングフォームビデオ理解をサポートするように設計されている。パイプラインを厳格に評価するために、高品質で人書きのキャプションのベンチマークであるDenseCaption100をキュレートしました。評価は、自動生成ステップと人間のアノテーションの間に強い整合性を示す。さらに,DenseStep2Mの有効性を,高密度ビデオキャプション,手続き的ステップグラウンド,モーダル検索の3つの中核的下流タスクで検証した。 DenseStep2Mで微調整されたモデルでは、キャプションの品質と時間的局所化が大幅に向上し、エゴセントリック、エキソセントリック、混合パースペクティブドメインにわたる堅牢なゼロショット一般化が示される。これらの結果は,高度多モードアライメントと長期活動推論におけるDenseStep2Mの有効性を裏付けるものである。私たちのデータセットはhttps://huggingface.co/datasets/mingjige/DenseStep2Mで公開されています。

論文の概要: DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

関連論文リスト