Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
- URL: http://arxiv.org/abs/2003.13942v1
- Date: Tue, 31 Mar 2020 03:58:11 GMT
- Title: Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
- Authors: Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan
Adeli, Juan Carlos Niebles
- Abstract summary: We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
- Score: 50.034189314258356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning is a challenging task that requires a deep understanding of
visual scenes. State-of-the-art methods generate captions using either
scene-level or object-level information but without explicitly modeling object
interactions. Thus, they often fail to make visually grounded predictions, and
are sensitive to spurious correlations. In this paper, we propose a novel
spatio-temporal graph model for video captioning that exploits object
interactions in space and time. Our model builds interpretable links and is
able to provide explicit visual grounding. To avoid unstable performance caused
by the variable number of objects, we further propose an object-aware knowledge
distillation mechanism, in which local object information is used to regularize
global scene features. We demonstrate the efficacy of our approach through
extensive experiments on two benchmarks, showing our approach yields
competitive performance with interpretable predictions.
Related papers
- Context-Aware Temporal Embedding of Objects in Video Data [0.8287206589886881]
In video analysis, understanding the temporal context is crucial for recognizing object interactions, event patterns, and contextual changes over time.
The proposed model leverages adjacency and semantic similarities between objects from neighboring video frames to construct context-aware temporal object embeddings.
Empirical studies demonstrate that our context-aware temporal embeddings can be used in conjunction with conventional visual embeddings to enhance the effectiveness of downstream applications.
arXiv Detail & Related papers (2024-08-23T01:44:10Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.