Related papers: Exploring The Missing Semantics In Event Modality

Exploring The Missing Semantics In Event Modality

URL: http://arxiv.org/abs/2510.17347v1
Date: Mon, 20 Oct 2025 09:45:13 GMT
Title: Exploring The Missing Semantics In Event Modality
Authors: Jingqian Wu, Shengpeng Xu, Yunbo Jia, Edmund Y. Lam,
Abstract summary: Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture.<n>Event-to-video reconstruction (E2V) remains challenging, particularly for reconstructing and recovering semantic information.<n>We propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality.
Score: 15.06471990384093
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

Related papers

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction [33.79474114703357]
We propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI)<n>Our model consists of two core components: Cross-modal Frame Aggregation and Context-aware Feature Enhancement.<n>Experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
arXiv Detail & Related papers (2025-11-13T09:48:12Z)
Multi-Level LVLM Guidance for Untrimmed Video Action Recognition [0.0]
This paper introduces the Event-Temporalized Video Transformer (ECVT), a novel architecture that bridges the gap between low-level visual features and high-level semantic information.<n>Experiments on ActivityNet v1.3 and THUMOS14 demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14.
arXiv Detail & Related papers (2025-08-24T16:45:21Z)
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z)
Learning Event Completeness for Weakly Supervised Video Anomaly Detection [5.140169437190526]
We present a novel Learning Event Completeness for Weakly Supervised Video Anomaly Detection (LEC-VAD)<n>LEC-VAD encodes both category-aware and category-agnostic semantics between vision and language.<n>We develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories.
arXiv Detail & Related papers (2025-06-16T04:56:58Z)
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z)
LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction [8.163356555241322]
We propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction. We first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise strategy to enhance spatial consistency.
arXiv Detail & Related papers (2024-07-08T01:40:32Z)
Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos. We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z)
E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z)
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task. It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z)
In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos. We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics. Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z)
Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS) Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage. We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.