Related papers: Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

URL: http://arxiv.org/abs/2507.17664v1
Date: Wed, 23 Jul 2025 16:29:52 GMT
Title: Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
Authors: Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau,
Abstract summary: We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception.<n>We provide over 30,000 validated referring expressions, each enriched with four grounding attributes.<n>We propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations.
Score: 6.174442475414146
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

Related papers

Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene [13.70771642812974]
We propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans.<n>We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input.<n>We employ a high-level module to deliver scalable yet comprehensive context.
arXiv Detail & Related papers (2025-07-25T12:57:05Z)
Grounded Gesture Generation: Language, Motion, and Space [3.4973270688542626]
We introduce a multimodal dataset and framework for grounded gesture generation.<n>We provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format.<n>Our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction.
arXiv Detail & Related papers (2025-07-06T20:19:34Z)
Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale [41.693908591580175]
We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
arXiv Detail & Related papers (2025-06-13T17:57:18Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction [8.163356555241322]
We propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction. We first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise strategy to enhance spatial consistency.
arXiv Detail & Related papers (2024-07-08T01:40:32Z)
Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z)
Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set. Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries. Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv Detail & Related papers (2024-01-27T11:07:19Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization [67.85434518679382]
We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene. voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
arXiv Detail & Related papers (2023-04-30T05:29:28Z)
Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities [43.048896440009784]
We propose the task of extracting event hierarchies from multimodal (video and text) data. This reveals the structure of events and is critical to understanding them. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task.
arXiv Detail & Related papers (2022-06-14T23:24:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.