Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal
Attention
- URL: http://arxiv.org/abs/2109.02955v1
- Date: Tue, 7 Sep 2021 09:22:09 GMT
- Title: Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal
Attention
- Authors: Katsuyuki Nakamura, Hiroki Ohashi, Mitsuhiro Okada
- Abstract summary: We propose a new task of sensor-augmented egocentric-video captioning.
We use wearable-sensor data as auxiliary information to mitigate the inherent problems in egocentric vision.
- Score: 0.9668407688201357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically describing video, or video captioning, has been widely studied
in the multimedia field. This paper proposes a new task of sensor-augmented
egocentric-video captioning, a newly constructed dataset for it called MMAC
Captions, and a method for the newly proposed task that effectively utilizes
multi-modal data of video and motion sensors, or inertial measurement units
(IMUs). While conventional video captioning tasks have difficulty in dealing
with detailed descriptions of human activities due to the limited view of a
fixed camera, egocentric vision has greater potential to be used for generating
the finer-grained descriptions of human activities on the basis of a much
closer view. In addition, we utilize wearable-sensor data as auxiliary
information to mitigate the inherent problems in egocentric vision: motion
blur, self-occlusion, and out-of-camera-range activities. We propose a method
for effectively utilizing the sensor data in combination with the video data on
the basis of an attention mechanism that dynamically determines the modality
that requires more attention, taking the contextual information into account.
We compared the proposed sensor-fusion method with strong baselines on the MMAC
Captions dataset and found that using sensor data as supplementary information
to the egocentric-video data was beneficial, and that our proposed method
outperformed the strong baselines, demonstrating the effectiveness of the
proposed method.
Related papers
- E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - Motion Capture from Inertial and Vision Sensors [60.5190090684795]
MINIONS is a large-scale Motion capture dataset collected from INertial and visION Sensors.
We conduct experiments on multi-modal motion capture using a monocular camera and very few IMUs.
arXiv Detail & Related papers (2024-07-23T09:41:10Z) - I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data [4.487146086221174]
We present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings.
Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations.
arXiv Detail & Related papers (2024-06-10T13:08:31Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - IndGIC: Supervised Action Recognition under Low Illumination [0.0]
We propose action recognition method using deep multi-input network.
Ind-GIC is proposed to enhance poor-illumination video, generating one gamma for one frame to increase enhancement performance.
Experimental results show that our model achieves high accuracy in on ARID dataset.
arXiv Detail & Related papers (2023-08-29T14:41:10Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z) - Self-Supervised Joint Encoding of Motion and Appearance for First Person
Action Recognition [19.93779132095822]
We argue that learning features jointly intertwine from these two information channels is beneficial.
We propose a single stream architecture able to do so, thanks to the addition of a self-supervised motion prediction block.
Experiments on several publicly available databases show the power of our approach.
arXiv Detail & Related papers (2020-02-10T17:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.