ViDiC: Video Difference Captioning
- URL: http://arxiv.org/abs/2512.03405v2
- Date: Thu, 04 Dec 2025 06:21:11 GMT
- Title: ViDiC: Video Difference Captioning
- Authors: Jiangtao Wu, Shihao Li, Zhaozhou Bian, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Yuanxing Zhang, Jiaheng Liu,
- Abstract summary: We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset.<n>ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items.<n>Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities.
- Score: 33.77620135109391
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.
Related papers
- ConViS-Bench: Estimating Video Similarity Through Semantic Concepts [57.40476559895395]
We introduce Concept-based Video Similarity estimation (ConViS)<n>ConViS compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts.<n>We also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains.
arXiv Detail & Related papers (2025-09-23T17:06:11Z) - Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening [54.66784646111214]
We introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions.<n>Our goal is to build time aware video representations which offer linear separability between these chiral pairs.<n>We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets.
arXiv Detail & Related papers (2025-09-10T11:23:10Z) - DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception [0.846600473226587]
We introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering.<n>We propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA.<n>DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information.
arXiv Detail & Related papers (2025-07-30T03:14:27Z) - Towards Understanding Camera Motions in Any Video [89.97247162415158]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.<n>CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.<n>One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval [26.40393400497247]
Video retrieval requires aligning visual content with corresponding natural language descriptions.<n>In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR)<n>We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts.
arXiv Detail & Related papers (2025-04-02T10:56:01Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.