MultiVENT: Multilingual Videos of Events with Aligned Natural Text
- URL: http://arxiv.org/abs/2307.03153v1
- Date: Thu, 6 Jul 2023 17:29:34 GMT
- Title: MultiVENT: Multilingual Videos of Events with Aligned Natural Text
- Authors: Kate Sanders, David Etter, Reno Kriz, Benjamin Van Durme
- Abstract summary: MultiVENT is a dataset of multilingual, event-centric videos grounded in text documents across five target languages.
We analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models.
- Score: 29.266266741468055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Everyday news coverage has shifted from traditional broadcasts towards a wide
range of presentation formats such as first-hand, unedited video footage.
Datasets that reflect the diverse array of multimodal, multilingual news
sources available online could be used to teach models to benefit from this
shift, but existing news video datasets focus on traditional news broadcasts
produced for English-speaking audiences. We address this limitation by
constructing MultiVENT, a dataset of multilingual, event-centric videos
grounded in text documents across five target languages. MultiVENT includes
both news broadcast videos and non-professional event footage, which we use to
analyze the state of online news videos and how they can be leveraged to build
robust, factually accurate models. Finally, we provide a model for complex,
multilingual video retrieval to serve as a baseline for information retrieval
using MultiVENT.
Related papers
- MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - Multi-modal News Understanding with Professionally Labelled Videos
(ReutersViLNews) [25.78619140103048]
We present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset.
The dataset focuses on high-level video-language understanding with an emphasis on long-form news.
The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms.
arXiv Detail & Related papers (2024-01-23T00:42:04Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
Media Short Videos [72.69052180249598]
We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj.
3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages.
We show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis.
arXiv Detail & Related papers (2022-03-28T02:47:01Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.