Segment Any Events with Language
- URL: http://arxiv.org/abs/2601.23159v1
- Date: Fri, 30 Jan 2026 16:42:56 GMT
- Title: Segment Any Events with Language
- Authors: Seungjun Lee, Gim Hee Lee,
- Abstract summary: We introduce SEAL, the first Semantic-aware Any Events framework that addresses Open-Vocabulary Event Instance (OV-EIS)<n>Given the visual prompt, our model presents a unified framework to support both segmentation event and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level.<n>Our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture.
- Score: 68.05185562243356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. Check out our project page in https://0nandon.github.io/SEAL
Related papers
- LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z) - Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation [0.0]
LangSeg is a novel semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors.<n>We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-01-27T20:02:12Z) - Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation [15.941958367737408]
Seg-TTO is a framework for zero-shot, open-vocabulary semantic segmentation.<n>We focus on segmentation-specific test-time optimization to address this gap.<n>Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art.
arXiv Detail & Related papers (2025-01-08T18:58:24Z) - OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras [18.07403094754705]
We introduce OVOSE, the first Open-Vocabulary Semantic algorithm for Event cameras.
We evaluate OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic.
OVOSE demonstrates superior performance, showcasing its potential for real-world applications.
arXiv Detail & Related papers (2024-08-18T09:56:32Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation [87.18373801829314]
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples"
We propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM)
SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks.
arXiv Detail & Related papers (2023-11-24T18:59:42Z) - Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff"
Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
arXiv Detail & Related papers (2023-07-03T06:02:15Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic
Segmentation [44.75300205362518]
Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations.
We propose the first top-down unsupervised semantic segmentation framework for fine-grained segmentation in extremely complicated scenarios.
Our results show that our top-down unsupervised segmentation is robust to both object-centric and scene-centric datasets.
arXiv Detail & Related papers (2021-12-02T18:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.