Video Action Differencing
- URL: http://arxiv.org/abs/2503.07860v1
- Date: Mon, 10 Mar 2025 21:18:32 GMT
- Title: Video Action Differencing
- Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy,
- Abstract summary: We introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action.<n>We first create VidDiffBench, a benchmark dataset containing 549 video pairs.<n>Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models.
- Score: 92.3218782696305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
Related papers
- DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning [2.4121373594852846]
Multi-Task Learning makes the visual task acquire the rich shareable knowledge from other tasks while joint training.
Heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem.
Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task.
arXiv Detail & Related papers (2024-08-29T01:25:36Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - TarViS: A Unified Approach for Target-based Video Segmentation [115.5770357189209]
TarViS is a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video.
Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks.
To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance (VIS), Video Panoptic (VPS), Video Object (VOS) and Point Exemplar-guided Tracking (PET)
arXiv Detail & Related papers (2023-01-06T18:59:52Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.