Related papers: ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

URL: http://arxiv.org/abs/2509.19245v1
Date: Tue, 23 Sep 2025 17:06:11 GMT
Title: ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Authors: Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Yiming Wang, Elisa Ricci, Paolo Rota,
Abstract summary: We introduce Concept-based Video Similarity estimation (ConViS)<n>ConViS compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts.<n>We also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains.
Score: 57.40476559895395
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.

Related papers

ViDiC: Video Difference Captioning [33.77620135109391]
We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset.<n>ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items.<n>Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities.
arXiv Detail & Related papers (2025-12-03T03:23:24Z)
VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations [65.0648741395158]
VADB is the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions.<n>VADB-Net is a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks.
arXiv Detail & Related papers (2025-10-29T07:37:08Z)
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval [26.40393400497247]
Video retrieval requires aligning visual content with corresponding natural language descriptions.<n>In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR)<n>We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts.
arXiv Detail & Related papers (2025-04-02T10:56:01Z)
Can Text-to-Video Generation help Video-Language Alignment? [53.0276936367765]
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models.<n>A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video.<n>In this work, we study whether synthetic videos can help to overcome this issue.<n>Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others.
arXiv Detail & Related papers (2025-03-24T10:02:22Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos. HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z)
CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z)
Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective [13.90183404059193]
We propose to learn correspondence using Video Frame-level Similarity (VFS) learning. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation.
arXiv Detail & Related papers (2021-03-31T17:56:35Z)
On Semantic Similarity in Video Retrieval [31.61611168620582]
We propose a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant. Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS)
arXiv Detail & Related papers (2021-03-18T09:12:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.