Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection in Educational Videos
- URL: http://arxiv.org/abs/2506.13657v2
- Date: Tue, 17 Jun 2025 04:05:44 GMT
- Title: Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection in Educational Videos
- Authors: Dipayan Biswas, Shishir Shah, Jaspal Subhlok,
- Abstract summary: The Lecture Video Visual Objects dataset is a new benchmark for visual object detection in educational video content.<n>The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Lecture Video Visual Objects (LVVO) dataset, a new benchmark for visual object detection in educational video content. The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences. A subset of 1,000 frames, referred to as LVVO_1k, has been manually annotated with bounding boxes for four visual categories: Table, Chart-Graph, Photographic-image, and Visual-illustration. Each frame was labeled independently by two annotators, resulting in an inter-annotator F1 score of 83.41%, indicating strong agreement. To ensure high-quality consensus annotations, a third expert reviewed and resolved all cases of disagreement through a conflict resolution process. To expand the dataset, a semi-supervised approach was employed to automatically annotate the remaining 3,000 frames, forming LVVO_3k. The complete dataset offers a valuable resource for developing and evaluating both supervised and semi-supervised methods for visual content detection in educational videos. The LVVO dataset is publicly available to support further research in this domain.
Related papers
- MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation [3.229267555477331]
MUVOD is a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios.<n>Each scene contains a minimum of 9 views and a maximum of 46 views.<n>We provide 7830 RGB images with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig.
arXiv Detail & Related papers (2025-07-10T08:07:59Z) - Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment [0.0]
This paper reports on a transfer learning approach for detecting visual elements in lecture video frames.<n>YOLO was optimized for lecture video object detection with training on multiple benchmark datasets and deploying a semi-supervised auto labeling strategy.
arXiv Detail & Related papers (2025-06-27T04:43:05Z) - A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods [6.076406622352117]
We introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries.
The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods.
arXiv Detail & Related papers (2024-06-05T06:43:48Z) - 360VOTS: Visual Object Tracking and Segmentation in Omnidirectional Videos [16.372814014632944]
We propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS)
360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories.
We benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset.
arXiv Detail & Related papers (2024-04-22T07:54:53Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene
Segmentation [12.104032818304745]
We construct the Tencent Ads Video'(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level.
TAVS describes videos from three independent perspectives as presentation form', place', and style', and contains rich multi-modal information such as video, audio, and text.
It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels.
arXiv Detail & Related papers (2022-12-09T07:26:20Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z) - Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.