Related papers: Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

URL: http://arxiv.org/abs/2602.21406v1
Date: Tue, 24 Feb 2026 22:23:22 GMT
Title: Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation
Authors: Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani,
Abstract summary: Temporal ActionMatrix (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting datasets infeasible.<n>We propose Open-Vocabulary Zero-Shot Temporal Action (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs)<n>We present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation.
Score: 12.112297992589314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

Related papers

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
Towards Open-Vocabulary Video Semantic Segmentation [40.58291642595943]
We introduce the Open Vocabulary Video Semantic (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories.<n>To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module.<n>Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context.
arXiv Detail & Related papers (2024-12-12T14:53:16Z)
SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation [53.010417880335424]
Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. We propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations.
arXiv Detail & Related papers (2023-12-19T17:26:44Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z)
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z)
Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories. We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories. To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z)
TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z)
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z)
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time. We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.