OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies
- URL: http://arxiv.org/abs/2405.05259v1
- Date: Wed, 8 May 2024 17:59:58 GMT
- Title: OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies
- Authors: Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi,
- Abstract summary: Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing.
We synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS.
We achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic benchmarks without using either event or frame labels.
- Score: 4.940059438666211
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
Related papers
- OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras [18.07403094754705]
We introduce OVOSE, the first Open-Vocabulary Semantic algorithm for Event cameras.
We evaluate OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic.
OVOSE demonstrates superior performance, showcasing its potential for real-world applications.
arXiv Detail & Related papers (2024-08-18T09:56:32Z) - CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding [52.67839570524888]
We present CEIA, an effective framework for open-world event-based understanding.
We leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP.
CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets.
arXiv Detail & Related papers (2024-07-09T07:26:15Z) - Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv Detail & Related papers (2024-01-27T11:07:19Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - ESS: Learning Event-based Semantic Segmentation from Still Images [48.37422967330683]
Event-based semantic segmentation is still in its infancy due to the novelty of the sensor and the lack of high-quality, labeled datasets.
We introduce ESS, which transfers the semantic segmentation task from existing labeled image datasets to unlabeled events via unsupervised domain adaptation (UDA)
To spur further research in event-based semantic segmentation, we introduce DSEC-Semantic, the first large-scale event-based dataset with fine-grained labels.
arXiv Detail & Related papers (2022-03-18T15:30:01Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - Learning Constraints and Descriptive Segmentation for Subevent Detection [74.48201657623218]
We propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction.
We adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model.
arXiv Detail & Related papers (2021-09-13T20:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.