Expanding Event Modality Applications through a Robust CLIP-Based Encoder
- URL: http://arxiv.org/abs/2412.03093v1
- Date: Wed, 04 Dec 2024 07:44:58 GMT
- Title: Expanding Event Modality Applications through a Robust CLIP-Based Encoder
- Authors: Sungheon Jeong, Hanning Chen, Sanggeon Yun, Suhyeon Cho, Wenjun Huang, Xiangjian Liu, Mohsen Imani,
- Abstract summary: This paper introduces a powerful encoder that transfers CLIPs capabilities to event-based data.
We adapt CLIPs architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment.
Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks.
- Score: 6.139775949624247
- License:
- Abstract: This paper introduces a powerful encoder that transfers CLIP`s capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIP`s architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.
Related papers
- Deep Learning-based Event Data Coding: A Joint Spatiotemporal and Polarity Solution [45.8313373627054]
Event cameras generate a massive number of pixel-level events composed bytemporal and polarity information.
This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution adopting a single-point cloud representation.
It is shown that it is possible to use lossy event data coding with its reduced rate regarding coding without compromising the target computer vision task performance.
arXiv Detail & Related papers (2025-02-05T15:39:55Z) - CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding [52.67839570524888]
We present CEIA, an effective framework for open-world event-based understanding.
We leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP.
CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets.
arXiv Detail & Related papers (2024-07-09T07:26:15Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Federated Multi-View Synthesizing for Metaverse [52.59476179535153]
The metaverse is expected to provide immersive entertainment, education, and business applications.
Virtual reality (VR) transmission over wireless networks is data- and computation-intensive.
We have developed a novel multi-view synthesizing framework that can efficiently provide synthesizing, storage, and communication resources for wireless content delivery in the metaverse.
arXiv Detail & Related papers (2023-12-18T13:51:56Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - EventCLIP: Adapting CLIP for Event-based Object Recognition [26.35633454924899]
EventCLIP is a novel approach that utilizes CLIP for zero-shot and few-shot event-based object recognition.
We first generalize CLIP's image encoder to event data by converting raw events to 2D grid-based representations.
We evaluate EventCLIP on N-Caltech, N-Cars, and N-ImageNet datasets, achieving state-of-the-art few-shot performance.
arXiv Detail & Related papers (2023-06-10T06:05:35Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.
Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - Dual Transfer Learning for Event-based End-task Prediction via Pluggable
Event to Image Translation [33.28163268182018]
Event cameras perceive per-pixel intensity changes and output asynchronous event streams with high dynamic range and less motion blur.
It has been shown that events alone can be used for end-task learning, eg, semantic segmentation, based on encoder-decoder-like networks.
We propose a simple yet flexible two-stream framework named Dual Transfer Learning (DTL) to effectively enhance the performance on the end-tasks.
arXiv Detail & Related papers (2021-09-04T06:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.