Fugu-MT 論文翻訳(概要): Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

論文の概要: Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

arxiv url: http://arxiv.org/abs/2606.02120v1
Date: Mon, 01 Jun 2026 11:50:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.902213
Title: Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection
Title（参考訳）: 長手型自己中心型ミステイク検出のための理解促進型モデルコラボレーション
Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang,
Abstract要約: 理解力向上型モデル協調手法(UE-MCM)を提案する。より効率的な粗いビデオ理解と正確なきめ細かなアクション推論を組み合わせる。結果として得られるシステムは速度と精度のバランスを保ち、エゴセントリックな指導ビデオの微妙で稀で曖昧な誤りを検出するのに効果的である。
参考スコア（独自算出の注目度）: 85.49213290363834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.
Abstract（参考訳）: 本稿では,エゴセントリックなビデオデータからユーザが誤ってアクションを実行するかどうかを判断する問題に対処する。そこで本研究では,効率的な粗粒度映像理解と高精度なアクション推論を併用した理解強化モデル協調手法(UE-MCM)を提案する。具体的には、UE-MCMは小さなモデルブランチと大きなモデルブランチを含む。大きなモデルブランチは、きめ細かいアクション自体が誤って実行されるかどうかに焦点を当て、小さなモデルブランチは、粗いビデオときめ細かいセグメントを入力として、局所的に正しいが全体的なワークフローと矛盾するアクションを特定する。小型モデルブランチはDiffusion Contrastive Reconstructionによって強化されたCLIPモデルから初期化されたCLIP4CLIPビデオエンコーダ上に構築され、大型モデルブランチはQwen3-VL Embeddingモデルを使用して、きめ細かいアクションセグメントから高容量表現を抽出する。小分岐予測と大分岐予測は、軽量な協調ゲートによって適応的に融合される。誤り事例の長期分布を扱うために,再重み付きクロスエントロピー,AUC指向学習,ラベル認識調整など,相補的目的の分類器を最適化する。結果として得られるシステムは速度と精度のバランスを保ち、エゴセントリックな指導ビデオの微妙で稀で曖昧な誤りを検出するのに効果的である。

関連論文リスト

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models [12.459927405623624]
Divide-and-Conquer Inference (DCI)はMLLMを用いた視覚認識のための新しいテスト時間スケーリング戦略である。 DCIは、複雑なグローバルな分類タスクを、より単純で局所化されたサブプロブレムに分解し、探索空間を圧縮するために動的プルーニング機構を使用する。モデルに依存しないプラグアンドプレイのパラダイムとして、DCIは大規模なシナリオでMLLMの推論精度をスケールするための効率的なアプローチを提供する。
論文参考訳（メタデータ） (2026-05-24T01:07:05Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
本稿では,dLLMの復号化機構をモデル属性の強力なツールとして利用できることを示す。本稿では、デコードステップ間の構造的関係を捉え、モデル固有の振る舞いをよりよく明らかにする、DDM(Directed Decoding Map)と呼ばれる新しい情報抽出手法を提案する。
論文参考訳（メタデータ） (2025-10-02T06:25:10Z)
Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking [17.095655627061934]
本稿では,重み行列を整列結合空間に分解・コーディネートするために特異値分解を利用する,単純かつ効果的な手法であるデコム・リノルム・マージ(DRM)を提案する。実験の結果,DRMは完全微調整および低ランク適応設定において,最先端のマージ技術よりも優れていた。
論文参考訳（メタデータ） (2025-05-29T05:37:53Z)
DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection [52.74152717667157]
本稿では,DAM(Dual Attention Module)と呼ばれる軽量モジュールを提案する。フレームアテンション機構を使用して、最も重要なフレームを識別し、スケルトンアテンション機構を使用して、最小パラメータとフロップで固定されたパーティション間の広範な関係をキャプチャする。
論文参考訳（メタデータ） (2024-06-05T06:18:03Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
ビジョントランスフォーマーは多くの計算タスクで素晴らしいパフォーマンスを達成した。密度の高い接続は、より大きな膨張比をサポートするスパースブロック対角構造に置き換えることができることを示す。また、トレーニング中に並列分岐として、軽量でパラメータフリーなチャネル共分散アテンション機構を提案する。
論文参考訳（メタデータ） (2023-12-01T08:22:34Z)
Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
ビデオオブジェクトセグメンテーション(VOS)のための教師なし学習のための新しいアプローチを提案する。これまでの研究とは異なり、我々の定式化によって、完全に畳み込みの仕組みで、密集した特徴表現を直接学習することができる。我々の手法は、トレーニングデータや計算能力が大幅に少ないにもかかわらず、以前の作業のセグメンテーション精度を超える。
論文参考訳（メタデータ） (2021-11-11T15:15:11Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
変分オートエンコーダ(VAE)は、強力で広く使われている生成モデルのクラスである。 GMMに対して解析的に計算できるCauchy-Schwarz分散に基づく新しい制約対象を導入する。本研究の目的は,密度推定,教師なしクラスタリング,半教師なし学習,顔分析における変分自動エンコーディングモデルの改善である。
論文参考訳（メタデータ） (2021-01-06T17:36:26Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。