Multi-modal News Understanding with Professionally Labelled Videos
(ReutersViLNews)
- URL: http://arxiv.org/abs/2401.12419v1
- Date: Tue, 23 Jan 2024 00:42:04 GMT
- Title: Multi-modal News Understanding with Professionally Labelled Videos
(ReutersViLNews)
- Authors: Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan
Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid
Sigal, Yalda Mohsenzadeh
- Abstract summary: We present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset.
The dataset focuses on high-level video-language understanding with an emphasis on long-form news.
The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms.
- Score: 25.78619140103048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While progress has been made in the domain of video-language understanding,
current state-of-the-art algorithms are still limited in their ability to
understand videos at high levels of abstraction, such as news-oriented videos.
Alternatively, humans easily amalgamate information from video and language to
infer information beyond what is visually observable in the pixels. An example
of this is watching a news story, where the context of the event can play as
big of a role in understanding the story as the event itself. Towards a
solution for designing this ability in algorithms, we present a large-scale
analysis on an in-house dataset collected by the Reuters News Agency, called
Reuters Video-Language News (ReutersViLNews) dataset which focuses on
high-level video-language understanding with an emphasis on long-form news. The
ReutersViLNews Dataset consists of long-form news videos collected and labeled
by news industry professionals over several years and contains prominent news
reporting from around the world. Each video involves a single story and
contains action shots of the actual event, interviews with people associated
with the event, footage from nearby areas, and more. ReutersViLNews dataset
contains videos from seven subject categories: disaster, finance,
entertainment, health, politics, sports, and miscellaneous with annotations
from high-level to low-level, title caption, visual video description,
high-level story description, keywords, and location. We first present an
analysis of the dataset statistics of ReutersViLNews compared to previous
datasets. Then we benchmark state-of-the-art approaches for four different
video-language tasks. The results suggest that news-oriented videos are a
substantial challenge for current video-language understanding algorithms and
we conclude by providing future directions in designing approaches to solve the
ReutersViLNews dataset.
Related papers
- A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - Video Summarization: Towards Entity-Aware Captions [75.71891605682931]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.