Auto-captions on GIF: A Large-scale Video-sentence Dataset for
Vision-language Pre-training
- URL: http://arxiv.org/abs/2007.02375v1
- Date: Sun, 5 Jul 2020 16:11:57 GMT
- Title: Auto-captions on GIF: A Large-scale Video-sentence Dataset for
Vision-language Pre-training
- Authors: Yingwei Pan and Yehao Li and Jianjie Luo and Jun Xu and Ting Yao and
Tao Mei
- Abstract summary: Auto-captions on GIF dataset is a new large-scale pre-training dataset for generic video understanding.
All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages.
- Score: 112.91603911837436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present Auto-captions on GIF, which is a new large-scale
pre-training dataset for generic video understanding. All video-sentence pairs
are created by automatically extracting and filtering video caption annotations
from billions of web pages. Auto-captions on GIF dataset can be utilized to
pre-train the generic feature representation or encoder-decoder structure for
video captioning, and other downstream tasks (e.g., sentence localization in
videos, video question answering, etc.) as well. We present a detailed analysis
of Auto-captions on GIF dataset in comparison to existing video-sentence
datasets. We also provide an evaluation of a Transformer-based encoder-decoder
structure for vision-language pre-training, which is further adapted to video
captioning downstream task and yields the compelling generalizability on
MSR-VTT. The dataset is available at
\url{http://www.auto-video-captions.top/2020/dataset}.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Synchronized Audio-Visual Frames with Fractional Positional Encoding for
Transformers in Video-to-Text Translation [26.36252496316238]
Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips.
Transformers have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT.
We explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture.
arXiv Detail & Related papers (2021-12-28T10:57:18Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.