Related papers: Action100M: A Large-scale Video Action Dataset

Action100M: A Large-scale Video Action Dataset

URL: http://arxiv.org/abs/2601.10592v1
Date: Thu, 15 Jan 2026 17:02:27 GMT
Title: Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung,
Abstract summary: Action100M is a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration)<n>It yields O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions.
Score: 33.33351591459689
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.

Related papers

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries [77.41072125938636]
ARC-Chapter is the first large-scale video chaptering model trained on over million-level long video chapters.<n>It unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries.<n>It establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score.
arXiv Detail & Related papers (2025-11-18T10:53:14Z)
DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding [13.256830504062332]
We introduce DEL, a framework for dense semantic action localization.<n> DEL aims to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos.
arXiv Detail & Related papers (2025-06-29T11:50:19Z)
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z)
Open-World Human-Object Interaction Detection via Multi-modal Prompts [26.355054079885463]
MP-HOI is a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions. MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times.
arXiv Detail & Related papers (2024-06-11T13:01:45Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z)
Temporal Alignment Networks for Long-term Video [103.69904379356413]
We propose a temporal alignment network that ingests long term video sequences, and associated text sentences. We train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset.
arXiv Detail & Related papers (2022-04-06T17:59:46Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.