Related papers: Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

URL: http://arxiv.org/abs/2011.09634v2
Date: Mon, 11 Jan 2021 09:52:47 GMT
Title: Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision
Authors: Yujie Zhong, Linhai Xie, Sen Wang, Lucia Specia, Yishu Miao
Abstract summary: We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
Score: 54.73758942064708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL.

Related papers

Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z)
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text. Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection. We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos. Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z)
Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs. We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.