FuguReport

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

Authors Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang
Affiliations Institute of Computing Technology, CAS / University of the Chinese Academy of Sciences / Beijing Academy of Artificial Intelligence / Institute of Information Engineering, CAS
Categories Method / Model Collaboration / Enhanced cooperation for video understanding, Application / Video Analysis / Egocentric action mistake detection, Evaluation / Performance Trade-offs / Speed and accuracy balance assessment
License CC BY 4.0

Abstract Overview

This paper studies egocentric mistake detection in instructional videos, where the goal is to decide whether a target action is performed incorrectly. The proposed Understanding-Enhanced Model Collaboration Method (UE-MCM) uses two complementary branches: a small branch for efficient workflow-level reasoning from both the full coarse video and the fine action segment, and a large branch for fine-grained action-level judgment from the target segment. The small branch is built with a DCR-enhanced CLIP4CLIP encoder, while the large branch uses Qwen3-VL Embedding features, and their predictions are combined through an adaptive collaboration gate. To address the rarity of mistake samples, the training objective combines reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment for long-tailed optimization.

Novelty

The main novelty is the explicit collaboration between two models with different roles: one reasons about whether an action is appropriate within the broader workflow, while the other judges whether the action execution itself is wrong. The method also combines this branch-specialized design with adaptive prediction fusion and a multi-objective long-tail training strategy tailored to rare and ambiguous egocentric mistakes.

Results

On the reported test set, the method achieves an F-score of 0.60 using only RGB input. In the table, this is higher than the listed TimeSformer baselines (up to 0.40), the 2024 top solution (0.51), and the 2025 top solution (0.57). The reported breakdown also shows improved correct recall over the 2025 top solution (0.72 vs. 0.60) and much higher mistake recall than the 2024 top solution (0.62 vs. 0.09).

Key Points

  1. UE-MCM separates workflow-level inconsistency reasoning and action-level execution reasoning into a small branch and a large branch, then fuses them with an adaptive collaboration gate.
  2. The small branch jointly encodes the full coarse video and the fine action segment using a DCR-enhanced CLIP4CLIP encoder, while the large branch uses frozen Qwen3-VL Embedding features from the fine segment.
  3. The training setup targets class imbalance by combining reweighted cross-entropy, AUC-oriented loss, and label-aware adjustment, and the final system reports a 0.60 F-score on the test set with RGB only.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.