OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement
- URL: http://arxiv.org/abs/2003.03715v5
- Date: Tue, 14 Jul 2020 16:51:47 GMT
- Title: OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement
- Authors: Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Guang Chen, Jun Guo
- Abstract summary: We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
- Score: 44.228748086927375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional video captioning requests a holistic description of the video,
yet the detailed descriptions of the specific objects may not be available.
Without associating the moving trajectories, these image-based data-driven
methods cannot understand the activities from the spatio-temporal transitions
in the inter-object visual features. Besides, adopting ambiguous clip-sentence
pairs in training, it goes against learning the multi-modal functional mappings
owing to the one-to-many nature. In this paper, we propose a novel task to
understand the videos in object-level, named object-oriented video captioning.
We introduce the video-based object-oriented video captioning network (OVC)-Net
via temporal graph and detail enhancement to effectively analyze the activities
along time and stably capture the vision-language connections under
small-sample condition. The temporal graph provides useful supplement over
previous image-based approaches, allowing to reason the activities from the
temporal evolution of visual features and the dynamic movement of spatial
locations. The detail enhancement helps to capture the discriminative features
among different objects, with which the subsequent captioning module can yield
more informative and precise descriptions. Thereafter, we construct a new
dataset, providing consistent object-sentence pairs, to facilitate effective
cross-modal learning. To demonstrate the effectiveness, we conduct experiments
on the new dataset and compare it with the state-of-the-art video captioning
methods. From the experimental results, the OVC-Net exhibits the ability of
precisely describing the concurrent objects, and achieves the state-of-the-art
performance.
Related papers
- LocoMotion: Learning Motion-Focused Video-Language Representations [45.33444862034461]
We propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions.
We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions.
arXiv Detail & Related papers (2024-10-15T19:33:57Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
Video Captioning [41.14313691818424]
We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning.
O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption.
Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
arXiv Detail & Related papers (2021-08-05T04:17:20Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.