VLCap: Vision-Language with Contrastive Learning for Coherent Video
Paragraph Captioning
- URL: http://arxiv.org/abs/2206.12972v1
- Date: Sun, 26 Jun 2022 20:51:05 GMT
- Title: VLCap: Vision-Language with Contrastive Learning for Coherent Video
Paragraph Captioning
- Authors: Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater,
Khoa Luu, Ngan Le
- Abstract summary: We leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos.
We propose vision-language features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects.
- Score: 8.676412113725561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we leverage the human perceiving process, that involves vision
and language interaction, to generate a coherent paragraph description of
untrimmed videos. We propose vision-language (VL) features consisting of two
modalities, i.e., (i) vision modality to capture global visual content of the
entire scene and (ii) language modality to extract scene elements description
of both human and non-human objects (e.g. animals, vehicles, etc), visual and
non-visual elements (e.g. relations, activities, etc). Furthermore, we propose
to train our proposed VLCap under a contrastive learning VL loss. The
experiments and ablation studies on ActivityNet Captions and YouCookII datasets
show that our VLCap outperforms existing SOTA methods on both accuracy and
diversity metrics.
Related papers
- Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning [19.73126931526359]
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
arXiv Detail & Related papers (2022-11-28T07:39:20Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Vision-and-Language Pretrained Models: A Survey [3.270244666687303]
We present an overview of the major advances achieved in Visual-Language Pretrained Models.
We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content.
arXiv Detail & Related papers (2022-04-15T07:33:06Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.