Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input
- URL: http://arxiv.org/abs/2406.03439v1
- Date: Wed, 5 Jun 2024 16:34:12 GMT
- Title: Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input
- Authors: Joachim Ott, Zuowen Wang, Shih-Chii Liu,
- Abstract summary: Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses.
This paper reports a method for creating new labelled event datasets by using a text-to-X model.
We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements.
- Score: 8.365349007799296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses. However, the development of deep network algorithms using event cameras has been slow because of the lack of large labelled event camera datasets for network training. This paper reports a method for creating new labelled event datasets by using a text-to-X model, where X is one or multiple output modalities, in the case of this work, events. Our proposed text-to-events model produces synthetic event frames directly from text prompts. It uses an autoencoder which is trained to produce sparse event frames representing event camera outputs. By combining the pretrained autoencoder with a diffusion model architecture, the new text-to-events model is able to generate smooth synthetic event streams of moving objects. The autoencoder was first trained on an event camera dataset of diverse scenes. In the combined training with the diffusion model, the DVS gesture dataset was used. We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements. The classification accuracy of the generated sequences, using a classifier trained on the real dataset, ranges between 42% to 92%, depending on the gesture group. The results demonstrate the capability of this method in synthesizing event datasets.
Related papers
- Event Camera Data Dense Pre-training [10.918407820258246]
This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data.
For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns.
arXiv Detail & Related papers (2023-11-20T04:36:19Z) - EGVD: Event-Guided Video Deraining [57.59935209162314]
We propose an end-to-end learning-based network to unlock the potential of the event camera for video deraining.
We build a real-world dataset consisting of rainy videos and temporally synchronized event streams.
arXiv Detail & Related papers (2023-09-29T13:47:53Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - Event Camera Data Pre-training [14.77724035068357]
Our model is a self-supervised learning framework, and uses paired event camera data and natural RGB images for training.
We achieve top-1 accuracy at 64.83% on the N-ImageNet dataset.
arXiv Detail & Related papers (2023-01-05T06:32:50Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z) - Bina-Rep Event Frames: a Simple and Effective Representation for
Event-based cameras [1.6114012813668934]
"Bina-Rep" is a simple representation method that converts asynchronous streams of events from event cameras to a sequence of sparse and expressive event frames.
Our method is able to obtain sparser and more expressive event frames thanks to the retained information about event orders in the original stream.
arXiv Detail & Related papers (2022-02-28T10:23:09Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - Bridging the Gap between Events and Frames through Unsupervised Domain
Adaptation [57.22705137545853]
We propose a task transfer method that allows models to be trained directly with labeled images and unlabeled event data.
We leverage the generative event model to split event features into content and motion features.
Our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks.
arXiv Detail & Related papers (2021-09-06T17:31:37Z) - VisEvent: Reliable Object Tracking via Collaboration of Frame and Event
Flows [93.54888104118822]
We propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task.
Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios.
Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods.
arXiv Detail & Related papers (2021-08-11T03:55:12Z) - EventHands: Real-Time Neural 3D Hand Reconstruction from an Event Stream [80.15360180192175]
3D hand pose estimation from monocular videos is a long-standing and challenging problem.
We address it for the first time using a single event camera, i.e., an asynchronous vision sensor reacting on brightness changes.
Our approach has characteristics previously not demonstrated with a single RGB or depth camera.
arXiv Detail & Related papers (2020-12-11T16:45:34Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.