Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera
- URL: http://arxiv.org/abs/2405.19794v1
- Date: Thu, 30 May 2024 08:02:05 GMT
- Title: Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera
- Authors: Inpyo Song, Minjun Joo, Joonhyung Kwon, Jangwon Lee,
- Abstract summary: This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction.
To alleviate these challenges, we introduce a novel visual question answering dataset.
It features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings.
- Score: 2.427410108595295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.
Related papers
- Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval [85.73149096516543]
We address the choice of viewpoint during sketch creation in Fine-Grained Sketch-Based Image Retrieval (FG-SBIR)
A pilot study highlights the system's struggle when query-sketches differ in viewpoint from target instances.
To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks.
arXiv Detail & Related papers (2024-07-01T21:20:44Z) - 360+x: A Panoptic Multi-modal Scene Understanding Dataset [13.823967656097146]
360+x is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world.
To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world.
arXiv Detail & Related papers (2024-04-01T08:34:42Z) - RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos [2.7569134765233536]
RID-Twin is a pipeline that decouples identity from motion to perform automatic face de-identification in videos.
We evaluate the performance of our methodology on the widely employed VoxCeleb2 dataset.
arXiv Detail & Related papers (2024-03-15T06:59:21Z) - Video Recognition in Portrait Mode [98.3393666122704]
We develop the first dataset dedicated to portrait mode video recognition, namely PortraitMode-400.
We conduct a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats.
We design experiments to explore key aspects of portrait mode video recognition, including the choice of data augmentation, evaluation procedure, the importance of temporal information, and the role of audio modality.
arXiv Detail & Related papers (2023-12-21T11:30:02Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z) - SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [97.0162841635425]
We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions.
We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions.
arXiv Detail & Related papers (2020-11-02T16:18:06Z) - Perceptual Quality Assessment of Omnidirectional Images as Moving Camera
Videos [49.217528156417906]
Two types of VR viewing conditions are crucial in determining the viewing behaviors of users and the perceived quality of the panorama.
We first transform an omnidirectional image to several video representations using different user viewing behaviors under different viewing conditions.
We then leverage advanced 2D full-reference video quality models to compute the perceived quality.
arXiv Detail & Related papers (2020-05-21T10:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.