Related papers: Object-Shot Enhanced Grounding Network for Egocentric Video

Object-Shot Enhanced Grounding Network for Egocentric Video

URL: http://arxiv.org/abs/2505.04270v1
Date: Wed, 07 May 2025 09:20:12 GMT
Title: Object-Shot Enhanced Grounding Network for Egocentric Video
Authors: Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie,
Abstract summary: We propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video.<n>Specifically, we extract object information from videos to enrich video representation.<n>We analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information.
Score: 60.97916755629796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

Related papers

Fine-grained Spatiotemporal Grounding on Egocentric Videos [13.319346673043286]
We introduce EgoMask, the first pixel-level benchmark for fine-temporal grounding in egocentric videos.<n>EgoMask is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks.<n>We also create EgoMask-Train, a large-scale training dataset to facilitate model development.
arXiv Detail & Related papers (2025-08-01T10:53:27Z)
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z)
Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks. Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z)
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications [20.571026014771828]
We provide a labeled dataset consisting of 11,243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with. Our dataset is the first to label detailed hand-object contact boundaries. We show that our robust hand-object segmentation model and dataset can serve as a foundational tool to boost or enable several downstream vision applications.
arXiv Detail & Related papers (2022-08-07T21:43:40Z)
Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks. We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions. We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs) Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.