VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning
- URL: http://arxiv.org/abs/2211.15103v1
- Date: Mon, 28 Nov 2022 07:39:20 GMT
- Title: VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning
- Authors: Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le
- Abstract summary: Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
- Score: 19.73126931526359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video paragraph captioning aims to generate a multi-sentence description of
an untrimmed video with several temporal event locations in coherent
storytelling. Following the human perception process, where the scene is
effectively understood by decomposing it into visual (e.g. human, animal) and
non-visual components (e.g. action, relations) under the mutual influence of
vision and language, we first propose a visual-linguistic (VL) feature. In the
proposed VL feature, the scene is modeled by three modalities including (i) a
global visual environment; (ii) local visual main agents; (iii) linguistic
scene elements. We then introduce an autoregressive Transformer-in-Transformer
(TinT) to simultaneously capture the semantic coherence of intra- and
inter-event contents within a video. Finally, we present a new VL contrastive
loss function to guarantee learnt embedding features are matched with the
captions semantics. Comprehensive experiments and extensive ablation studies on
ActivityNet Captions and YouCookII datasets show that the proposed
Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
state-of-the-art methods on accuracy and diversity.
Related papers
- OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - VLCap: Vision-Language with Contrastive Learning for Coherent Video
Paragraph Captioning [8.676412113725561]
We leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos.
We propose vision-language features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects.
arXiv Detail & Related papers (2022-06-26T20:51:05Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Multimodal Incremental Transformer with Visual Grounding for Visual
Dialogue Generation [25.57530524167637]
Visual dialogue needs to answer a series of coherent questions on the basis of understanding the visual environment.
Visual grounding aims to explicitly locate related objects in the image guided by textual entities.
multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response.
arXiv Detail & Related papers (2021-09-17T11:39:29Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.