Exploiting Context Information for Generic Event Boundary Captioning
- URL: http://arxiv.org/abs/2207.01050v1
- Date: Sun, 3 Jul 2022 14:14:54 GMT
- Title: Exploiting Context Information for Generic Event Boundary Captioning
- Authors: Jinrui Zhang, Teng Wang, Feng Zheng, Ran Cheng, Ping Luo
- Abstract summary: Generic Event Boundary Captioning (GEBC) aims to generate three sentences describing the status change for a given time boundary.
To tackle this issue, we design a model that directly takes the whole video as input and generates captions for all boundaries parallelly.
- Score: 51.53874954616367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic Event Boundary Captioning (GEBC) aims to generate three sentences
describing the status change for a given time boundary. Previous methods only
process the information of a single boundary at a time, which lacks utilization
of video context information. To tackle this issue, we design a model that
directly takes the whole video as input and generates captions for all
boundaries parallelly. The model could learn the context information for each
time boundary by modeling the boundary-boundary interactions. Experiments
demonstrate the effectiveness of context information. The proposed method
achieved a 72.84 score on the test set, and we reached the $2^{nd}$ place in
this challenge. Our code is available at:
\url{https://github.com/zjr2000/Context-GEBC}
Related papers
- ObjectNLQ @ Ego4D Episodic Memory Challenge 2024 [51.57555556405898]
We present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024.
Both challenges require the localization of actions within long video sequences using textual queries.
We introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information.
arXiv Detail & Related papers (2024-06-22T07:57:58Z) - EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
Grounding with Multimodal Large Language Model [63.93372634950661]
We propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries.
Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries.
arXiv Detail & Related papers (2023-12-05T04:15:56Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Submission to Generic Event Boundary Detection Challenge@CVPR 2022:
Local Context Modeling and Global Boundary Decoding Approach [46.97359231258202]
Generic event boundary detection (GEBD) is an important yet challenging task in video understanding.
We present a local context modeling and global boundary decoding approach for GEBD task.
arXiv Detail & Related papers (2022-06-30T13:19:53Z) - Reading Between the Lines: Exploring Infilling in Visual Narratives [5.28005598366543]
We present a new large scale textitvisual procedure telling (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images.
We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling.
arXiv Detail & Related papers (2020-10-26T23:09:09Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.