The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
- URL: http://arxiv.org/abs/2503.17116v1
- Date: Fri, 21 Mar 2025 13:01:07 GMT
- Title: The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
- Authors: Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Björn Þór Jónsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Schöffmann, Florian Spiess, Allie Tran, Minh-Triet Tran, Quang-Linh Tran, Cathal Gurrin,
- Abstract summary: Egocentric video has seen increased interest in recent years, as it is used in a range of areas.<n>In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric video.<n>The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second.
- Score: 10.00887999108572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.
Related papers
- Video Individual Counting for Moving Drones [51.429771128144964]
Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance.
Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals.
We propose a density map based VIC method based on a MovingDroneCrowd dataset.
arXiv Detail & Related papers (2025-03-12T07:09:33Z) - HourVideo: 1-Hour Video-Language Understanding [34.90495038962066]
HourVideo is a benchmark dataset for hour-long video-language understanding.
HourVideo includes 500 manually curated egocentric videos spanning durations of 20 to 120 minutes.
Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Panonut360: A Head and Eye Tracking Dataset for Panoramic Video [0.0]
We present a head and eye tracking dataset involving 50 users watching 15 panoramic videos.
The dataset provides details on the viewport and gaze attention locations of users.
Our analysis reveals a consistent downward offset in gaze fixations relative to the Field of View.
arXiv Detail & Related papers (2024-03-26T13:54:52Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z) - Zenseact Open Dataset: A large-scale and diverse multimodal dataset for
autonomous driving [3.549770828382121]
Zenseact Open dataset (ZOD) is a large-scale and diverse dataset collected over two years in various European countries.
ZOD boasts the highest range and resolution sensors among comparable datasets.
The dataset is composed of Frames, Sequences, and Drives, designed to encompass both data diversity and support for multimodal-temporal learning.
arXiv Detail & Related papers (2023-05-03T09:59:18Z) - FSVVD: A Dataset of Full Scene Volumetric Video [2.9151420469958533]
In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
arXiv Detail & Related papers (2023-03-07T02:31:08Z) - Argoverse 2: Next Generation Datasets for Self-Driving Perception and
Forecasting [64.7364925689825]
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain.
The Lidar dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose.
The Motion Forecasting dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene.
arXiv Detail & Related papers (2023-01-02T00:36:22Z) - TIMo -- A Dataset for Indoor Building Monitoring with a Time-of-Flight
Camera [9.746370805708095]
We present TIMo, a dataset for video-based monitoring of indoor spaces captured using a time-of-flight (ToF) camera.
The resulting depth videos feature people performing a set of different predefined actions.
Person detection for people counting and anomaly detection are the two targeted applications.
arXiv Detail & Related papers (2021-08-27T09:33:11Z) - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [88.47608066382267]
We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
arXiv Detail & Related papers (2020-04-29T21:57:04Z) - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval [111.93601253692165]
TV show Retrieval (TVR) is a new multimodal retrieval dataset.
TVR requires systems to understand both videos and their associated subtitle (dialogue) texts.
The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres.
arXiv Detail & Related papers (2020-01-24T17:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.