Related papers: Superevents: Towards Native Semantic Segmentation for Event-based Cameras

Superevents: Towards Native Semantic Segmentation for Event-based Cameras

URL: http://arxiv.org/abs/2105.06091v1
Date: Thu, 13 May 2021 05:49:41 GMT
Title: Superevents: Towards Native Semantic Segmentation for Event-based Cameras
Authors: Weng Fei Low, Ankit Sonthalia, Zhi Gao, Andr\'e van Schaik, Bharath Ramesh
Abstract summary: Most successful computer vision models transform low-level features, such as Gabor filter responses, into richer representations of intermediate or mid-level complexity for downstream visual tasks. We present a novel method that employs lifetime augmentation for obtaining an event stream representation that is fed to a fully convolutional network to extract superevents.
Score: 13.099264910430986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most successful computer vision models transform low-level features, such as Gabor filter responses, into richer representations of intermediate or mid-level complexity for downstream visual tasks. These mid-level representations have not been explored for event cameras, although it is especially relevant to the visually sparse and often disjoint spatial information in the event stream. By making use of locally consistent intermediate representations, termed as superevents, numerous visual tasks ranging from semantic segmentation, visual tracking, depth estimation shall benefit. In essence, superevents are perceptually consistent local units that delineate parts of an object in a scene. Inspired by recent deep learning architectures, we present a novel method that employs lifetime augmentation for obtaining an event stream representation that is fed to a fully convolutional network to extract superevents. Our qualitative and quantitative experimental results on several sequences of a benchmark dataset highlights the significant potential for event-based downstream applications.

Related papers

Multimodal Referring Segmentation: A Survey [93.24051010753817]
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.<n>Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models.
arXiv Detail & Related papers (2025-08-01T02:14:00Z)
EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision [12.542303392870329]
Event-stream representation is the first step for many computer vision tasks using event cameras. We introduce a data-driven approach aiming at enhancing the quality of event-stream representations.
arXiv Detail & Related papers (2024-12-10T00:42:54Z)
Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras [45.063747874243276]
We present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation. The proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data. We show that the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths.
arXiv Detail & Related papers (2024-07-15T20:00:50Z)
Event-to-Video Conversion for Overhead Object Detection [7.744259147081667]
Event cameras complicate downstream image processing, especially for complex tasks such as object detection. We show that there is a significant gap in performance between dense event representations and corresponding RGB frames. We apply event-to-video conversion models that convert event streams into gray-scale video to close this gap.
arXiv Detail & Related papers (2024-02-09T22:07:39Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features [52.213656737672935]
SpikeMOT is an event-based multi-object tracker. SpikeMOT uses spiking neural networks to extract sparsetemporal features from event streams associated with objects.
arXiv Detail & Related papers (2023-09-29T05:13:43Z)
A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented, Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z)
Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams [19.957857885844838]
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. We propose an attentionaware model named Event Voxel Set Transformer (EVSTr) for efficient representation learning on event streams. Experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
arXiv Detail & Related papers (2023-03-07T12:48:02Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Deep ViT Features as Dense Visual Descriptors [12.83702462166513]
We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. These descriptors facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences.
arXiv Detail & Related papers (2021-12-10T20:15:03Z)
Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames. Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. We combine the resulting representations with global scene information for accurately predicting visual saliency. Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.