LAVIB: A Large-scale Video Interpolation Benchmark
- URL: http://arxiv.org/abs/2406.09754v1
- Date: Fri, 14 Jun 2024 06:44:01 GMT
- Title: LAVIB: A Large-scale Video Interpolation Benchmark
- Authors: Alexandros Stergiou,
- Abstract summary: LAVIB comprises a large collection of high-resolution videos sourced from the web through an automated pipeline.
Metrics are computed for each video's motion magnitudes, luminance conditions, frame sharpness, and contrast.
In total, LAVIB includes 283K clips from 17K ultra-HD videos, covering 77.6 hours.
- Score: 58.194606275650095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a LArge-scale Video Interpolation Benchmark (LAVIB) for the low-level video task of video frame interpolation (VFI). LAVIB comprises a large collection of high-resolution videos sourced from the web through an automated pipeline with minimal requirements for human verification. Metrics are computed for each video's motion magnitudes, luminance conditions, frame sharpness, and contrast. The collection of videos and the creation of quantitative challenges based on these metrics are under-explored by current low-level video task datasets. In total, LAVIB includes 283K clips from 17K ultra-HD videos, covering 77.6 hours. Benchmark train, val, and test sets maintain similar video metric distributions. Further splits are also created for out-of-distribution (OOD) challenges, with train and test splits including videos of dissimilar attributes.
Related papers
- VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs [20.168429351519055]
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
VideoNIAH decouples test video content from query-responses by inserting unrelated image/text 'needles' into original videos.
It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation [11.331198234997714]
Third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding.
This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge.
Our solution stands on the shoulders of giant vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance frameworks.
arXiv Detail & Related papers (2024-06-08T04:43:08Z) - V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [76.26890864487933]
We introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube.
V2Xum-LLM is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder.
Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks.
arXiv Detail & Related papers (2024-04-18T17:32:46Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Video-Data Pipelines for Machine Learning Applications [0.9594432031144714]
The proposed framework can be scaled to additional video-sequence data sets for ML versioned deployments.
We analyze the performance of the proposed video-data pipeline for versioned deployment and monitoring for object detection algorithms.
arXiv Detail & Related papers (2021-10-15T20:28:56Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.