Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
- URL: http://arxiv.org/abs/2511.20525v1
- Date: Tue, 25 Nov 2025 17:29:12 GMT
- Title: Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
- Authors: Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso,
- Abstract summary: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video.<n>MATT attributes mistakes to the input instruction text or the attempt video.<n>We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets.
- Score: 11.138754178370514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
Related papers
- Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving [34.44189129139084]
3D box annotation based on data from active sensors is challenging in dynamic scenarios.<n>Our work is the first to discover such annotation errors in widely used, publicly available datasets.<n>Our approach increases the quality of box annotations by more than 17% in these datasets.
arXiv Detail & Related papers (2026-01-20T14:57:48Z) - Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models [25.564425023762045]
Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks.<n>Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module?<n>We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty.
arXiv Detail & Related papers (2025-10-02T18:29:07Z) - Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection [85.0189917888094]
We propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework to handle the challenges posed by subtle and infrequent mistakes.<n>The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances.
arXiv Detail & Related papers (2025-09-16T12:00:42Z) - Is this chart lying to me? Automating the detection of misleading visualizations [74.26574031329689]
Misleading visualizations are a potent driver of misinformation on social media and the web.<n>We introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders.<n>We also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables.
arXiv Detail & Related papers (2025-08-29T14:36:45Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Parameter-tuning-free data entry error unlearning with adaptive
selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning.
We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks.
The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Annotating and Detecting Fine-grained Factual Errors for Dialogue
Summarization [34.85353544844499]
We present the first dataset with fine-grained factual error annotations named DIASUMFACT.
We define fine-grained factual error detection as a sentence-level multi-label classification problem.
We propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models.
arXiv Detail & Related papers (2023-05-26T00:18:33Z) - Multi-level Memory-augmented Appearance-Motion Correspondence Framework
for Video Anomaly Detection [1.9511777443446219]
We propose a multi-level memory-augmented appearance-motion correspondence framework.
The latent correspondence between appearance and motion is explored via appearance-motion semantics alignment and semantics replacement training.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 99.6%, 93.8%, and 76.3% on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.
arXiv Detail & Related papers (2023-03-09T08:43:06Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.