Generic Event Boundary Captioning: A Benchmark for Status Changes
Understanding
- URL: http://arxiv.org/abs/2204.00486v1
- Date: Fri, 1 Apr 2022 14:45:30 GMT
- Title: Generic Event Boundary Captioning: A Benchmark for Status Changes
Understanding
- Authors: Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli,
Mike Zheng Shou
- Abstract summary: We introduce a new dataset called Kinetic-GEBC (Generic Event Boundary Captioning)
The dataset consists of over 170k boundaries associated with captions describing status changes in 12K videos.
We propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes.
- Score: 22.618840285782127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cognitive science has shown that humans perceive videos in terms of events
separated by state changes of dominant subjects. State changes trigger new
events and are one of the most useful among the large amount of redundant
information perceived. However, previous research focuses on the overall
understanding of segments without evaluating the fine-grained status changes
inside. In this paper, we introduce a new dataset called Kinetic-GEBC (Generic
Event Boundary Captioning). The dataset consists of over 170k boundaries
associated with captions describing status changes in the generic events in 12K
videos. Upon this new dataset, we propose three tasks supporting the
development of a more fine-grained, robust, and human-like understanding of
videos through status changes. We evaluate many representative baselines in our
dataset, where we also design a new TPD (Temporal-based Pairwise Difference)
Modeling method for current state-of-the-art backbones and achieve significant
performance improvements. Besides, the results show there are still formidable
challenges for current methods in the utilization of different granularities,
representation of visual difference, and the accurate localization of status
changes. Further analysis shows that our dataset can drive developing more
powerful methods to understand status changes and thus improve video level
comprehension.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Anticipating Object State Changes [0.8428703116072809]
The proposed framework predicts object state changes that will occur in the near future due to yet unseen human actions.
It integrates learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions.
The proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems.
arXiv Detail & Related papers (2024-05-21T13:40:30Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - MS-Former: Memory-Supported Transformer for Weakly Supervised Change
Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection.
MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS)
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z) - Visual Reasoning: from State to Transformation [80.32402545546209]
Existing visual reasoning tasks ignore an important factor, i.e.transformation.
We propose a novel textbftransformation driven visual reasoning (TVR) task.
We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
arXiv Detail & Related papers (2023-05-02T14:24:12Z) - Video Event Extraction via Tracking Visual States of Arguments [72.54932474653444]
We propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments.
In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments.
arXiv Detail & Related papers (2022-11-03T13:12:49Z) - What's in a Caption? Dataset-Specific Linguistic Diversity and Its
Effect on Visual Description Models and Metrics [14.624063829492764]
We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions.
We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
arXiv Detail & Related papers (2022-05-12T17:55:08Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.