Visual Grounding from Event Cameras
- URL: http://arxiv.org/abs/2509.09584v1
- Date: Thu, 11 Sep 2025 16:21:59 GMT
- Title: Visual Grounding from Event Cameras
- Authors: Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau,
- Abstract summary: We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data.<n>Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions.<n>We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.
- Score: 26.670030443187482
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes -- appearance, status, relation to the viewer, and relation to surrounding objects -- that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.
Related papers
- Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z) - While recognizing actions, LMMs struggle to detect core interaction events [18.828641379675243]
We introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset.<n>250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached.<n>We show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends.
arXiv Detail & Related papers (2025-11-25T10:38:41Z) - Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene [13.70771642812974]
We propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans.<n>We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input.<n>We employ a high-level module to deliver scalable yet comprehensive context.
arXiv Detail & Related papers (2025-07-25T12:57:05Z) - Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras [6.174442475414146]
We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception.<n>We provide over 30,000 validated referring expressions, each enriched with four grounding attributes.<n>We propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations.
arXiv Detail & Related papers (2025-07-23T16:29:52Z) - Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale [41.693908591580175]
We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
arXiv Detail & Related papers (2025-06-13T17:57:18Z) - Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions.<n>Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z) - Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions.
We propose a new approach that decomposes the complex problem into two more manageable sub-problems.
For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.