Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language
- URL: http://arxiv.org/abs/2011.09530v1
- Date: Wed, 18 Nov 2020 20:21:19 GMT
- Title: Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language
- Authors: Hassan Akbari, Hamid Palangi, Jianwei Yang, Sudha Rao, Asli
Celikyilmaz, Roland Fernandez, Paul Smolensky, Jianfeng Gao, Shih-Fu Chang
- Abstract summary: We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
- Score: 148.0843278195794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neuro-symbolic representations have proved effective in learning structure
information in vision and language. In this paper, we propose a new model
architecture for learning multi-modal neuro-symbolic representations for video
captioning. Our approach uses a dictionary learning-based method of learning
relations between videos and their paired text descriptions. We refer to these
relations as relative roles and leverage them to make each token role-aware
using attention. This results in a more structured and interpretable
architecture that incorporates modality-specific inductive biases for the
captioning task. Intuitively, the model is able to learn spatial, temporal, and
cross-modal relations in a given pair of video and text. The disentanglement
achieved by our proposal gives the model more capacity to capture multi-modal
structures which result in captions with higher quality for videos. Our
experiments on two established video captioning datasets verifies the
effectiveness of the proposed approach based on automatic metrics. We further
conduct a human evaluation to measure the grounding and relevance of the
generated captions and observe consistent improvement for the proposed model.
The codes and trained models can be found at
https://github.com/hassanhub/R3Transformer
Related papers
- Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model [10.742625681420279]
Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos.
Results demonstrate that our model not only significantly outperforms state-of-the-art methods in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions.
arXiv Detail & Related papers (2024-03-20T17:03:38Z) - EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z) - CaMEL: Mean Teacher Learning for Image Captioning [47.9708610052655]
We present CaMEL, a novel Transformer-based architecture for image captioning.
Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase.
Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors.
arXiv Detail & Related papers (2022-02-21T19:04:46Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.