Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection
Abstract Overview
This paper studies egocentric mistake detection in instructional videos, where the goal is to decide whether a target action is performed incorrectly. The proposed Understanding-Enhanced Model Collaboration Method (UE-MCM) uses two complementary branches: a small branch for efficient workflow-level reasoning from both the full coarse video and the fine action segment, and a large branch for fine-grained action-level judgment from the target segment. The small branch is built with a DCR-enhanced CLIP4CLIP encoder, while the large branch uses Qwen3-VL Embedding features, and their predictions are combined through an adaptive collaboration gate. To address the rarity of mistake samples, the training objective combines reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment for long-tailed optimization.
Novelty
The main novelty is the explicit collaboration between two models with different roles: one reasons about whether an action is appropriate within the broader workflow, while the other judges whether the action execution itself is wrong. The method also combines this branch-specialized design with adaptive prediction fusion and a multi-objective long-tail training strategy tailored to rare and ambiguous egocentric mistakes.
Results
On the reported test set, the method achieves an F-score of 0.60 using only RGB input. In the table, this is higher than the listed TimeSformer baselines (up to 0.40), the 2024 top solution (0.51), and the 2025 top solution (0.57). The reported breakdown also shows improved correct recall over the 2025 top solution (0.72 vs. 0.60) and much higher mistake recall than the 2024 top solution (0.62 vs. 0.09).
Key Points
- UE-MCM separates workflow-level inconsistency reasoning and action-level execution reasoning into a small branch and a large branch, then fuses them with an adaptive collaboration gate.
- The small branch jointly encodes the full coarse video and the fine action segment using a DCR-enhanced CLIP4CLIP encoder, while the large branch uses frozen Qwen3-VL Embedding features from the fine segment.
- The training setup targets class imbalance by combining reweighted cross-entropy, AUC-oriented loss, and label-aware adjustment, and the final system reports a 0.60 F-score on the test set with RGB only.