Related papers: GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

URL: http://arxiv.org/abs/2204.00486v5
Date: Sat, 01 Feb 2025 16:16:51 GMT
Title: GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Authors: Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou,
Abstract summary: We introduce a new dataset called Kinetic-GEB+.<n>The dataset consists of over 170k boundaries associated with captions describing status changes in 12K videos.<n>We propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes.
Score: 40.399017565653196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cognitive science has shown that humans perceive videos in terms of events separated by the state changes of dominant subjects. State changes trigger new events and are one of the most useful among the large amount of redundant information perceived. However, previous research focuses on the overall understanding of segments without evaluating the fine-grained status changes inside. In this paper, we introduce a new dataset called Kinetic-GEB+. The dataset consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos. Upon this new dataset, we propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes. We evaluate many representative baselines in our dataset, where we also design a new TPD (Temporal-based Pairwise Difference) Modeling method for visual difference and achieve significant performance improvements. Besides, the results show there are still formidable challenges for current methods in the utilization of different granularities, representation of visual difference, and the accurate localization of status changes. Further analysis shows that our dataset can drive developing more powerful methods to understand status changes and thus improve video level comprehension. The dataset including both videos and boundaries is available at https://yuxuan-w.github.io/GEB-plus/

Related papers

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning [78.44705665291741]
We present a comprehensive evaluation of modern video self-supervised models. We focus on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions.
arXiv Detail & Related papers (2025-04-08T06:00:28Z)
SPOC: Spatially-Progressing Object State Change Segmentation in Video [52.65373395382122]
We introduce the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We demonstrate useful implications for tracking activity progress to benefit robotic agents.
arXiv Detail & Related papers (2025-03-15T01:48:54Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
Anticipating Object State Changes [0.8428703116072809]
The proposed framework predicts object state changes that will occur in the near future due to yet unseen human actions. It integrates learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. The proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems.
arXiv Detail & Related papers (2024-05-21T13:40:30Z)
OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z)
MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection. MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS) Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z)
Visual Reasoning: from State to Transformation [80.32402545546209]
Existing visual reasoning tasks ignore an important factor, i.e.transformation. We propose a novel textbftransformation driven visual reasoning (TVR) task. We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
arXiv Detail & Related papers (2023-05-02T14:24:12Z)
Self-supervised learning of Split Invariant Equivariant representations [0.0]
We introduce 3DIEBench, consisting of renderings from 3D models over 55 classes and more than 2.5 million images where we have full control on the transformations applied to the objects. We introduce a predictor architecture based on hypernetworks to learn equivariant representations with no possible collapse to invariance. We introduce SIE (Split Invariant-Equivariant) which combines the hypernetwork-based predictor with representations split in two parts, one invariant, the other equivariant, to learn richer representations.
arXiv Detail & Related papers (2023-02-14T07:53:18Z)
Video Event Extraction via Tracking Visual States of Arguments [72.54932474653444]
We propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments.
arXiv Detail & Related papers (2022-11-03T13:12:49Z)
What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics [14.624063829492764]
We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
arXiv Detail & Related papers (2022-05-12T17:55:08Z)
Human Instance Segmentation and Tracking via Data Association and Single-stage Detector [17.46922710432633]
Human video instance segmentation plays an important role in computer understanding of human activities. Most current VIS methods are based on Mask-RCNN framework. We develop a new method for human video instance segmentation based on single-stage detector.
arXiv Detail & Related papers (2022-03-31T11:36:09Z)
Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input. We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture. The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z)
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. In this paper, we first study how biases in the dataset affect existing methods. We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.